You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/11 23:00:14 UTC

[GitHub] [spark] c21 commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable

c21 commented on pull request #28123:
URL: https://github.com/apache/spark/pull/28123#issuecomment-657143464

Thanks @imback82 for making this change!

Sorry for late comment, just a few questions:

(1).Is there a reason why we don't cover ShuffledHashJoin as well? (we are seeing in production, people also use ShuffledHashJoin a lot for joining bucketed tables when one side is small)

(2).Per this PR, the ordering property of coalesced bucket files does not preserve (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L317), and the ordering can be preserved through a sort-merge-way read of all sorted buckets file. This can help when reading multiple partitions of bucketed table.

(3).We are seeing in production, coalescing might hurt the parallelism, if the number of buckets are too few. Another way to avoid shuffle and sort, is to split/divide the table with less buckets. E.g. joining tables with t1 (8 buckets) and t2 (32 buckets), we can keep number of tasks to be 32, and each task for reading t1 table will have a filter at run-time to only keep its portion of table (divide the table with less buckets). This has downside of reading the t1 more than once from multiple tasks, but if the size of t1 is not big, it's a good trade off to have more parallelism (and may be better than shuffling t1 directly).

We are running above 3 features years in facebook (https://databricks.com/session_eu19/spark-sql-bucketing-at-facebook), and I would like to make or help the followup change if this sounds reasonable for everyone. cc @imback82, @cloud-fan, @viirya, @gatorsmile and @sameeragarwal.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org