You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "stevenzwu (via GitHub)" <gi...@apache.org> on 2023/10/16 23:00:28 UTC

[I] Flink: revert the automatic custom partitioner for bucketing column with hash distribution [iceberg]

stevenzwu opened a new issue, #8847:
URL: https://github.com/apache/iceberg/issues/8847

   ### Apache Iceberg version
   
   1.4.0 (latest release)
   
   ### Query engine
   
   Flink
   
   ### Please describe the bug 🐞
   
   see details from this comment: https://github.com/apache/iceberg/pull/7161#issuecomment-1763706777


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution [iceberg]

Posted by "stevenzwu (via GitHub)" <gi...@apache.org>.
stevenzwu commented on issue #8847:
URL: https://github.com/apache/iceberg/issues/8847#issuecomment-1765662039

   > We should be careful about default behavior changes 
   
   agree. This is on me with the wrong assumption that bucketing column is the only thing need to be distributed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution [iceberg]

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on issue #8847:
URL: https://github.com/apache/iceberg/issues/8847#issuecomment-1765460065

   Thanks, @stevenzwu! I agree that reverting the behavior change makes the most sense. We should be careful about default behavior changes and rolling back the change (but not the feature) sounds reasonable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Flink: revert the automatic custom partitioner for bucketing column with hash distribution [iceberg]

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on issue #8847:
URL: https://github.com/apache/iceberg/issues/8847#issuecomment-1765411414

   @stevenzwu, can you help us understand what is a problem with this and why it should be removed from the 1.4.1 release?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution [iceberg]

Posted by "nastra (via GitHub)" <gi...@apache.org>.
nastra closed issue #8847: Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution
URL: https://github.com/apache/iceberg/issues/8847


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution [iceberg]

Posted by "stevenzwu (via GitHub)" <gi...@apache.org>.
stevenzwu commented on issue #8847:
URL: https://github.com/apache/iceberg/issues/8847#issuecomment-1765429213

   @rdblue here is the recap from the discussions on the PR #7161. https://github.com/apache/iceberg/pull/7161#issuecomment-1761169778
   
   PR #7161 automatically apply the custom bucketing partitioner to distribute buckets to writer tasks in a balanced way.[ It only looks at the bucket column](https://github.com/apache/iceberg/blob/main/flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java#L528-L531) (ignoring other partition columns) with the assumption that the bucket column is the main thing we need to distribute. 
   
   But a user reports that they have a partition spec like date, hour, minute, bucket(8). PR #7161 imposed a new default behavior that changed the distribution from simple keyBy on tuples with all partition columns to a custom partitioner with only bucket column. To me, the partition strategy is questionable. Bucket column is used here mainly to work around the skewed data distribution across partition columns and unbalanced value distribution from simple keyBy.
   
   In the end, I feel it is safer to revert the behavior change from PR #7161 and ask users to manually apply the customer partitioner for the bucket column. Previously, we were thinking about automatically enable it when the partition spec has a bucketing column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution [iceberg]

Posted by "nastra (via GitHub)" <gi...@apache.org>.
nastra commented on issue #8847:
URL: https://github.com/apache/iceberg/issues/8847#issuecomment-1766722846

   Closing this as #8848 has been merged to main and I backported it to 1.4.x in https://github.com/apache/iceberg/pull/8858


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org