You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/01 06:55:57 UTC

[GitHub] [spark] c21 commented on pull request #36733: [SPARK-39344][SQL] Only disable bucketing when autoBucketedScan is enabled if bucket columns are not in scan output

c21 commented on PR #36733:
URL: https://github.com/apache/spark/pull/36733#issuecomment-1143188178

@manuzhang - from my understanding, you want to introduce the feature to enforce number of Spark tasks to be same as number of table buckets, when query not reading bucket column(s). I agree with @cloud-fan in https://github.com/apache/spark/pull/27924#issuecomment-1139340835 that it should not be a design goal for bucketed table to control number of Spark tasks.

If you are really want to control number of tasks, you can either tune `spark.sql.files.maxPartitionBytes` or add an extra shuffle `repartition()`/`DISTRIBUTE BY`. I understand your concern per https://github.com/apache/spark/pull/27924#issuecomment-1139360593, but I am afraid of we are introducing a feature here not actually used by many other Spark users. To be honest, the required feature seems not popular based on my experience. My 2 cent is it might help us to post in Spark dev mailing list to gather more feedback from developers / users if they indeed has similar requirement.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org