You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/08 19:01:29 UTC

[GitHub] [iceberg] kbendick commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

kbendick commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-915492512


   > @WinkerDu I definitely agreed the v2 bin-pack algorithm should be improved for v2 to consider the total size of insert & delete files. I think the `iterms-per-bin` proposed from you team is trying to resolve the unbalanced issue, but I'm concerning it's hard to set the correct `iterms-per-bin` value for a given table in real production environment, because the `iterms-per-bin` is still controlling the data file's count. We actually don't have a real suitable approach to evaluate the cost about joining the data file size & its delete records. I think we need more accurate approach to decide which scan tasks should be dispatched to different tasks.
   
   I need to spend some time looking closer at the test cases (and probably try this out on some V2 tables), but I share the concern that this config value might be really hard to determine in a production env. Especially for example Flink users who have CDC streams, often times databases will experience a burst of deletes / updates due to some cron schedule and will then have an outsized number of delete files for a period of time (assuming partitioned by time as well).
   
   Wondering how we would go about picking a good number (or how often one would need to set a non-standard number, or change the number for individual sections of the table).
   
   That said, I'm also not adverse to adding another argument to make bin packing more useful in the near-term while we figure out the best way to have a more "V2 native" algorithm / parameter set.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org