You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/09 12:59:41 UTC

[GitHub] [spark] zhengruifeng opened a new pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

zhengruifeng opened a new pull request #34850:
URL: https://github.com/apache/spark/pull/34850


   ### What changes were proposed in this pull request?
   
   Deduplicate the right side of left-semi join and left-anti join
   
   
   ### Why are the changes needed?
   
   1, reduce the shuffle amount in the right side;
   2, improve the chance to broadcast the right side;
   3, reslove skewed keys in the right side;
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   added testsuits
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34850:
URL: https://github.com/apache/spark/pull/34850#issuecomment-989830685


   **[Test build #146038 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146038/testReport)** for PR 34850 at commit [`49dd1ad`](https://github.com/apache/spark/commit/49dd1ad5cb9021dc8d13a25b4e123fa18a2d0503).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #34850:
URL: https://github.com/apache/spark/pull/34850#discussion_r765773928



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -122,6 +122,7 @@ abstract class Optimizer(catalogManager: CatalogManager)
         RewriteCorrelatedScalarSubquery,
         RewriteLateralSubquery,
         EliminateSerialization,
+        DeduplicateLeftSemiLeftAntiRightSide,

Review comment:
       It should after `RewriteSubquery` to cover more cases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34850:
URL: https://github.com/apache/spark/pull/34850#issuecomment-989830685


   **[Test build #146038 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146038/testReport)** for PR 34850 at commit [`49dd1ad`](https://github.com/apache/spark/commit/49dd1ad5cb9021dc8d13a25b4e123fa18a2d0503).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34850:
URL: https://github.com/apache/spark/pull/34850#issuecomment-989921408


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50513/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #34850:
URL: https://github.com/apache/spark/pull/34850#discussion_r765776375



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -2148,6 +2149,17 @@ object RewriteIntersectAll extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Deduplicate the right side of left-semi join and left-anti join.
+ */
+object DeduplicateLeftSemiLeftAntiRightSide extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning(
+    _.containsPattern(LEFT_SEMI_OR_ANTI_JOIN), ruleId) {
+    case join @ Join(_, right, LeftSemiOrAnti(_), _, _) if !right.isInstanceOf[Aggregate] =>
+      join.copy(right = Aggregate(right.output, right.output, right))

Review comment:
       Deduplicate is not always has benefit. This is my initial PR: https://github.com/apache/spark/pull/33465/commits/b5599010ba969ba4cc3a2ce85549fe226b75ae65




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34850:
URL: https://github.com/apache/spark/pull/34850#issuecomment-989937582


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146038/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34850:
URL: https://github.com/apache/spark/pull/34850#issuecomment-989937582


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146038/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34850:
URL: https://github.com/apache/spark/pull/34850#issuecomment-989840218


   test case:
   
   ```
   spark.range(0, 10000, 1, 10).selectExpr("id % 1000 as key1", "id % 3 as value1").createOrReplaceTempView("table1")
   
   spark.range(0, 10000000, 1, 10).selectExpr("id % 100 as key2", "id as value2").createOrReplaceTempView("table2")
   
   spark.sql("SELECT key1 FROM table1 LEFT ANTI JOIN table2 ON key1 = key2").write.mode("overwrite").parquet("/tmp/tmp1.parquet")
   
   ```
   
   master:
   ```
   == Physical Plan ==
   Execute InsertIntoHadoopFsRelationCommand (22)
   +- AdaptiveSparkPlan (21)
      +- == Final Plan ==
         * Project (14)
         +- * SortMergeJoin LeftAnti (13)
            :- * Sort (5)
            :  +- AQEShuffleRead (4)
            :     +- ShuffleQueryStage (3)
            :        +- Exchange (2)
            :           +- * Range (1)
            +- * Sort (12)
               +- AQEShuffleRead (11)
                  +- ShuffleQueryStage (10)
                     +- Exchange (9)
                        +- * Project (8)
                           +- * Filter (7)
                              +- * Range (6)
   
   
   
   ```
   
   
   
   this pr:
   ```
   == Physical Plan ==
   Execute InsertIntoHadoopFsRelationCommand (25)
   +- AdaptiveSparkPlan (24)
      +- == Final Plan ==
         * Project (16)
         +- * BroadcastHashJoin LeftAnti BuildRight (15)
            :- AQEShuffleRead (4)
            :  +- ShuffleQueryStage (3)
            :     +- Exchange (2)
            :        +- * Range (1)
            +- BroadcastQueryStage (14)
               +- BroadcastExchange (13)
                  +- * HashAggregate (12)
                     +- AQEShuffleRead (11)
                        +- ShuffleQueryStage (10)
                           +- Exchange (9)
                              +- * HashAggregate (8)
                                 +- * Project (7)
                                    +- * Filter (6)
                                       +- * Range (5)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34850:
URL: https://github.com/apache/spark/pull/34850#issuecomment-989928809


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50513/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34850:
URL: https://github.com/apache/spark/pull/34850#issuecomment-989937176


   **[Test build #146038 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146038/testReport)** for PR 34850 at commit [`49dd1ad`](https://github.com/apache/spark/commit/49dd1ad5cb9021dc8d13a25b4e123fa18a2d0503).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34850:
URL: https://github.com/apache/spark/pull/34850#issuecomment-989868213


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50513/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34850: [SPARK-37597][WIP] Deduplicate the right side of left-semi join and left-anti join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34850:
URL: https://github.com/apache/spark/pull/34850#issuecomment-989928809


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50513/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org