Update defaults of max-concurrent-file-group-rewrites to 5#6907
Conversation
| String MAX_CONCURRENT_FILE_GROUP_REWRITES = "max-concurrent-file-group-rewrites"; | ||
|
|
||
| int MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT = 1; | ||
| int MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT = 5; |
There was a problem hiding this comment.
I support the idea of changing this default as we ask our users to change it in 9/10 cases.
@RussellSpitzer, any thoughts on this as you wrote this part?
There was a problem hiding this comment.
I like 5 but what I'd really like us to do is check the size of our "file groups" then use that to determine what to set max concurrent groups too. For example if we see all file groups are less than say target file size, we set Max Conncurrent = Num Cores
There was a problem hiding this comment.
We would probably need to look at the scheduler implementation (e.g. FIFO) and total resources available in the cluster vs the amount of data we need to compact and partitions. I think it makes sense exploring but if it becomes too tricky, I'd just default it to something more reasonable.
There was a problem hiding this comment.
I support the idea of changing this default as we ask our users to change it in 9/10 cases.
True. Weekly once or twice, I am suggesting users configure it in Iceberg slack. Most of them doesn't read javadoc :(
I would love to see this default value getting changed.
+1 for this change.
There was a problem hiding this comment.
I'm sold for this right now, then something more complicated later
|
It seems now |
|
The test was failing as one(or more) commit was failing This was happening since the number of concurrent writes was more than the retry attempts. |
|
Thanks for this update @karuppayya ! |
| - code: "java.field.constantValueChanged" | ||
| old: "field org.apache.iceberg.actions.RewriteDataFiles.MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT" | ||
| new: "field org.apache.iceberg.actions.RewriteDataFiles.MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT" | ||
| justification: "In most casse,users need more parallelism and is a default recommendation\ |
| old: "method void org.apache.iceberg.io.DataWriter<T>::add(T)" | ||
| justification: "Removing deprecated method" | ||
| "1.1.0": | ||
| org.apache.iceberg:iceberg-api: |
There was a problem hiding this comment.
I think this should land under 1.2.0 and not 1.1.0 because the change is being introduced after 1.2.0 was released (most likely because the branch was created before 1.2.0 was released). @karuppayya could you please follow-up with that and move this under 1.2.0?
There was a problem hiding this comment.
Ah yeah I forgot we just missed the release
There was a problem hiding this comment.
I've opened #7223 to address that. @RussellSpitzer could you review please?
No description provided.