You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/02 04:38:32 UTC

[GitHub] [spark] wangyum opened a new pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

wangyum opened a new pull request #29624:
URL: https://github.com/apache/spark/pull/29624


    This backports #29612 to branch-3.0. Original PR description:
   
   ### What changes were proposed in this pull request?
   
   Bucket join should work if `spark.sql.shuffle.partitions` larger than bucket number, such as:
   ```scala
   spark.range(1000).write.bucketBy(432, "id").saveAsTable("t1")
   spark.range(1000).write.bucketBy(34, "id").saveAsTable("t2")
   sql("set spark.sql.shuffle.partitions=600")
   sql("set spark.sql.autoBroadcastJoinThreshold=-1")
   sql("select * from t1 join t2 on t1.id = t2.id").explain()
   ```
   
   Before this pr:
   ```
   == Physical Plan ==
   *(5) SortMergeJoin [id#26L], [id#27L], Inner
   :- *(2) Sort [id#26L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#26L, 600), true
   :     +- *(1) Filter isnotnull(id#26L)
   :        +- *(1) ColumnarToRow
   :           +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432
   +- *(4) Sort [id#27L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(id#27L, 600), true
         +- *(3) Filter isnotnull(id#27L)
            +- *(3) ColumnarToRow
               +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34
   ```
   
   After this pr:
   ```
   == Physical Plan ==
   *(4) SortMergeJoin [id#26L], [id#27L], Inner
   :- *(1) Sort [id#26L ASC NULLS FIRST], false, 0
   :  +- *(1) Filter isnotnull(id#26L)
   :     +- *(1) ColumnarToRow
   :        +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432
   +- *(3) Sort [id#27L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(id#27L, 432), true
         +- *(2) Filter isnotnull(id#27L)
            +- *(2) ColumnarToRow
               +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34
   ```
   
   ### Why are the changes needed?
   
   Spark 2.4 support this.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   Unit test.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685291943






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685402644






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685402644






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685401777


   **[Test build #128191 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128191/testReport)** for PR 29624 at commit [`7273824`](https://github.com/apache/spark/commit/7273824e6fbb0dea987d8f9d13822c9f6483db62).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685682981






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685391545






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685390612


   **[Test build #128182 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128182/testReport)** for PR 29624 at commit [`7273824`](https://github.com/apache/spark/commit/7273824e6fbb0dea987d8f9d13822c9f6483db62).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685291943






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685682981






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum closed pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #29624:
URL: https://github.com/apache/spark/pull/29624


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685391545


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685401777


   **[Test build #128191 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128191/testReport)** for PR 29624 at commit [`7273824`](https://github.com/apache/spark/commit/7273824e6fbb0dea987d8f9d13822c9f6483db62).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-686587892


   cc @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685291601


   **[Test build #128182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128182/testReport)** for PR 29624 at commit [`7273824`](https://github.com/apache/spark/commit/7273824e6fbb0dea987d8f9d13822c9f6483db62).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685391575


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128182/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-686824158


   Merged to branch-3.0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685681644


   **[Test build #128191 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128191/testReport)** for PR 29624 at commit [`7273824`](https://github.com/apache/spark/commit/7273824e6fbb0dea987d8f9d13822c9f6483db62).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685291601


   **[Test build #128182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128182/testReport)** for PR 29624 at commit [`7273824`](https://github.com/apache/spark/commit/7273824e6fbb0dea987d8f9d13822c9f6483db62).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29624: [SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29624:
URL: https://github.com/apache/spark/pull/29624#issuecomment-685393762


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org