You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "stevenzwu (via GitHub)" <gi...@apache.org> on 2023/10/16 23:38:08 UTC

Re: [I] Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution [iceberg]

stevenzwu commented on issue #8847:
URL: https://github.com/apache/iceberg/issues/8847#issuecomment-1765429213

   @rdblue here is the recap from the discussions on the PR #7161. https://github.com/apache/iceberg/pull/7161#issuecomment-1761169778
   
   PR #7161 automatically apply the custom bucketing partitioner to distribute buckets to writer tasks in a balanced way.[ It only looks at the bucket column](https://github.com/apache/iceberg/blob/main/flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java#L528-L531) (ignoring other partition columns) with the assumption that the bucket column is the main thing we need to distribute. 
   
   But a user reports that they have a partition spec like date, hour, minute, bucket(8). PR #7161 imposed a new default behavior that changed the distribution from simple keyBy on tuples with all partition columns to a custom partitioner with only bucket column. To me, the partition strategy is questionable. Bucket column is used here mainly to work around the skewed data distribution across partition columns and unbalanced value distribution from simple keyBy.
   
   In the end, I feel it is safer to revert the behavior change from PR #7161 and ask users to manually apply the customer partitioner for the bucket column. Previously, we were thinking about automatically enable it when the partition spec has a bucketing column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org