You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/10/11 01:15:57 UTC

[GitHub] [iceberg] aokolnychyi commented on pull request #2276: Core: Add option to combine tasks by partition

aokolnychyi commented on PR #2276:
URL: https://github.com/apache/iceberg/pull/2276#issuecomment-1273967116

   I think this is an essential PR to support storage-partitioned joins in Spark 3.3. It would be great to rebase it.
   
   Here is what I noted based on our discussion earlier:
   - We want an ability to avoid combining tasks across particular partition columns in our `Scan` API.
   - We do NOT want a table property as the config does not really belong there and it would affect other engines.
   - We want Spark to propagate join attributes in the future. Until then, we should have a reasonable default behavior.
   
   I think it is reasonable to not combine files across partitions for partitioned tables by default in Spark, hoping we can benefit from storage-partitioned joins. However, I worry the new behavior may cause performance regressions in some cases as we will generate more splits (even though we may not benefit from any join optimizations). Do we want to expose a way to force combining files across partitions (i.e. old behavior)? There are two ways to support that: either add a read option in Iceberg or try checking if storage-partitioned joins are enabled in Spark SQL (if not, we can safely combine). Since Spark will pass join attributes in the future, adding a read option does not seem to make a lot of sense. Any thoughts?
   
   cc @sunchao @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org