You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/22 13:36:00 UTC

[GitHub] [spark] wangyum opened a new pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

wangyum opened a new pull request #31113:
URL: https://github.com/apache/spark/pull/31113


   ### What changes were proposed in this pull request?
   
   This pr add a new config(`spark.sql.optimizer.pushdownDistinctInSetOperations`) to support push down DISTINCT in Set operations(`INTERSECT`/`EXCEPT`).
   
   ### Why are the changes needed?
   
   Improve `INTERSECT`/`EXCEPT` performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test and benchmark test.
   
   
   SQL | Before this PR(Seconds) | After this PR(Seconds)
   -- | -- | --
   q8 | 37 | 40
   q14a | 600  | 84
   q14b | 600  | 72
   q38 | 60 | 31
   q87 | 59 | 36
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757484424


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133892/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-824934803


   **[Test build #137814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137814/testReport)** for PR 31113 at commit [`941a59d`](https://github.com/apache/spark/commit/941a59daa138a94ce8ace2f639dd0698d08294fa).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
tanelk commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757490576


   Does it work for `ReplaceExceptWithAntiJoin` also?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum edited a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
wangyum edited a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757554974


   > Does it work for `ReplaceExceptWithAntiJoin` also?
   
   Yes, it should work for `ReplaceExceptWithAntiJoin`,  but it doesn't improve TPC-DS queries. So I didn't add it to this pr. We can add it in fellowing pr.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757629585


   **[Test build #133901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133901/testReport)** for PR 31113 at commit [`ab52823`](https://github.com/apache/spark/commit/ab52823488b0b84cc0548fc021227fb025ef1288).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31113:
URL: https://github.com/apache/spark/pull/31113#discussion_r588905792



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1663,8 +1663,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] {
 /**
  * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
  * {{{
- *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
- *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
+ *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 ==>
+ *   SELECT a1, a2 FROM
+ *     (SELECT DISTINCT a1, a2 FROM Tab1)
+ *   LEFT SEMI JOIN
+ *     (SELECT DISTINCT b1, b2 FROM Tab2)

Review comment:
       Do you have any reference from the other DBMSs?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-824934803


   **[Test build #137814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137814/testReport)** for PR 31113 at commit [`941a59d`](https://github.com/apache/spark/commit/941a59daa138a94ce8ace2f639dd0698d08294fa).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-798765920


   **[Test build #136030 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136030/testReport)** for PR 31113 at commit [`6f3b87a`](https://github.com/apache/spark/commit/6f3b87a3bba6f7a827674b87b59e819af7419112).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31113:
URL: https://github.com/apache/spark/pull/31113#discussion_r589452404



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1663,8 +1663,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] {
 /**
  * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
  * {{{
- *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
- *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2

Review comment:
       Another similar case.
   Origin SQL:
   ```sql
   SELECT DISTINCT t.at_prod_ref_id, d.week_beg_dt
   FROM   table1 t
          JOIN table2 d
            ON d.dt BETWEEN t.start_dt AND t.end_dt
   WHERE  d.dt BETWEEN '2019-01-01' AND CURRENT_DATE()
          AND cid IN ( 0, 3, 15, 77, 71, 101, 186, 100 ); 
   ```
   Push down distinct through `BroadcastNestedLoopJoin`:
   ```sql
   SELECT DISTINCT t.at_prod_ref_id, d.week_beg_dt FROM
   (  SELECT DISTINCT at_prod_ref_id, start_dt, end_dt
      FROM table1 WHERE cid IN ( 0, 3, 15, 77, 71, 101, 186, 100 )
   ) t
   JOIN (
     SELECT DISTINCT dt, week_beg_dt FROM table2 WHERE dt BETWEEN '2019-01-01' AND current_date()
   ) d ON d.dt BETWEEN t.start_dt AND t.end_dt;
   ```
   
   ![image](https://user-images.githubusercontent.com/5399861/110332622-882d9500-805b-11eb-9b6f-31f17b109fdb.png)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-800334285


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40700/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-800319230


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136112/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757502638


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38483/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-825151837


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137814/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-791991221


   Sorry for being late, @wangyum . Could you rebase this PR to the master?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-824933386






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757496095


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38483/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757493938






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757578992


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38490/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757512417


   **[Test build #133893 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133893/testReport)** for PR 31113 at commit [`a42d9e0`](https://github.com/apache/spark/commit/a42d9e047dffd82d93d90b0031d5f0878fd6c55f).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-800336988


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136118/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757511891


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38484/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-800160523


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40694/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-798514862


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40614/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-798414942


   @dongjoon-hyun @cloud-fan Impala support this feature: https://github.com/apache/impala/commit/827070b473c02da480f3a9d77c59f7031f9070c2


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-824977523


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42344/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757493938






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757476738


   **[Test build #133892 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133892/testReport)** for PR 31113 at commit [`1dd0c1c`](https://github.com/apache/spark/commit/1dd0c1ceb9b5377eaed7372a94406528fcd82472).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31113:
URL: https://github.com/apache/spark/pull/31113#discussion_r593754471



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1663,8 +1663,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] {
 /**
  * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
  * {{{
- *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
- *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
+ *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 ==>
+ *   SELECT a1, a2 FROM
+ *     (SELECT DISTINCT a1, a2 FROM Tab1)
+ *   LEFT SEMI JOIN
+ *     (SELECT DISTINCT b1, b2 FROM Tab2)

Review comment:
       Impala support this feature: apache/impala@827070b




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757486006


   **[Test build #133894 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133894/testReport)** for PR 31113 at commit [`3a65541`](https://github.com/apache/spark/commit/3a65541da7876016f1c9d10069f176f27f387d01).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757512495


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133893/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757506748


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38484/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-825039496


   **[Test build #137812 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137812/testReport)** for PR 31113 at commit [`e226e2c`](https://github.com/apache/spark/commit/e226e2ceea7b648a1864c7ff1ba1e67eee83789b).
    * This patch passes all tests.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757485308


   **[Test build #133893 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133893/testReport)** for PR 31113 at commit [`a42d9e0`](https://github.com/apache/spark/commit/a42d9e047dffd82d93d90b0031d5f0878fd6c55f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757526460


   **[Test build #133894 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133894/testReport)** for PR 31113 at commit [`3a65541`](https://github.com/apache/spark/commit/3a65541da7876016f1c9d10069f176f27f387d01).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757489049


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38481/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757494922


   **[Test build #133895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133895/testReport)** for PR 31113 at commit [`bfc35f4`](https://github.com/apache/spark/commit/bfc35f4006020b86c98f6148542748fb1cc984cd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-824933452


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42342/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757529521


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133894/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-798514862


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40614/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757538061


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133895/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757502638


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38483/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-798772747


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136030/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757484424


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133892/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757511891


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38484/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31113:
URL: https://github.com/apache/spark/pull/31113#discussion_r595755856



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1663,8 +1663,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] {
 /**
  * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
  * {{{
- *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
- *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2

Review comment:
       Do you mean only push down distinct when CBO is enabled and all column stats are exist and  there are duplicate values?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-824977523


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42344/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-800284692


   **[Test build #136112 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136112/testReport)** for PR 31113 at commit [`f55d0eb`](https://github.com/apache/spark/commit/f55d0ebb762d95a731f0b8e12d75f6f96b2ec48c).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757592967


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38490/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757479385


   **[Test build #133892 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133892/testReport)** for PR 31113 at commit [`1dd0c1c`](https://github.com/apache/spark/commit/1dd0c1ceb9b5377eaed7372a94406528fcd82472).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757584148


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38490/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757554974


   > Does it work for `ReplaceExceptWithAntiJoin` also?
   
   Yes, it should work for `ReplaceExceptWithAntiJoin`,  but it doesn't improve TPC-DS queries. So I didn't add it to the pr.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-824976218






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-825041075


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137812/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757491894






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757572700


   **[Test build #133901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133901/testReport)** for PR 31113 at commit [`ab52823`](https://github.com/apache/spark/commit/ab52823488b0b84cc0548fc021227fb025ef1288).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757485308


   **[Test build #133893 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133893/testReport)** for PR 31113 at commit [`a42d9e0`](https://github.com/apache/spark/commit/a42d9e047dffd82d93d90b0031d5f0878fd6c55f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757572700


   **[Test build #133901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133901/testReport)** for PR 31113 at commit [`ab52823`](https://github.com/apache/spark/commit/ab52823488b0b84cc0548fc021227fb025ef1288).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757501694


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38484/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-800160523


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40694/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-800319230


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136112/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-800152464


   **[Test build #136112 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136112/testReport)** for PR 31113 at commit [`f55d0eb`](https://github.com/apache/spark/commit/f55d0ebb762d95a731f0b8e12d75f6f96b2ec48c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-824876818


   **[Test build #137812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137812/testReport)** for PR 31113 at commit [`e226e2c`](https://github.com/apache/spark/commit/e226e2ceea7b648a1864c7ff1ba1e67eee83789b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757512495


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133893/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757592967


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38490/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-798772747


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136030/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-800334285


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40700/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31113:
URL: https://github.com/apache/spark/pull/31113#discussion_r595261690



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1663,8 +1663,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] {
 /**
  * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
  * {{{
- *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
- *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2

Review comment:
       We cannot use column stats (`distinctCount`) to determine whether the optimizer push down it or not?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum closed pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #31113:
URL: https://github.com/apache/spark/pull/31113


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-824933452


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42342/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31113:
URL: https://github.com/apache/spark/pull/31113#discussion_r589321776



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1663,8 +1663,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] {
 /**
  * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
  * {{{
- *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
- *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2

Review comment:
       This seems like we push down the aggregate operator through the join operator. It's usually beneficial, but can cause perf regression if the join children do not have many duplications and the aggregate operator can't reduce the data volume.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757630823


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133901/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31113:
URL: https://github.com/apache/spark/pull/31113#discussion_r589429806



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1663,8 +1663,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] {
 /**
  * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
  * {{{
- *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
- *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2

Review comment:
       Could we add a config?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum edited a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
wangyum edited a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757554974


   > Does it work for `ReplaceExceptWithAntiJoin` also?
   
   Yes, it should work for `ReplaceExceptWithAntiJoin`,  but it doesn't improve TPC-DS queries. So I didn't add it to the pr. We can add it in fellowing pr.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757535350


   **[Test build #133895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133895/testReport)** for PR 31113 at commit [`bfc35f4`](https://github.com/apache/spark/commit/bfc35f4006020b86c98f6148542748fb1cc984cd).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-824876818


   **[Test build #137812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137812/testReport)** for PR 31113 at commit [`e226e2c`](https://github.com/apache/spark/commit/e226e2ceea7b648a1864c7ff1ba1e67eee83789b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757486006


   **[Test build #133894 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133894/testReport)** for PR 31113 at commit [`3a65541`](https://github.com/apache/spark/commit/3a65541da7876016f1c9d10069f176f27f387d01).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31113:
URL: https://github.com/apache/spark/pull/31113#discussion_r595762165



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1663,8 +1663,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] {
 /**
  * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
  * {{{
- *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
- *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2

Review comment:
       Yea, I meant so, `spark.sql.cbo.enabled` or `spark.sql.cbo.planStats.enabled`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757538061


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133895/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-825151837


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137814/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-798451114


   **[Test build #136030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136030/testReport)** for PR 31113 at commit [`6f3b87a`](https://github.com/apache/spark/commit/6f3b87a3bba6f7a827674b87b59e819af7419112).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31113:
URL: https://github.com/apache/spark/pull/31113#discussion_r589321776



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1663,8 +1663,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] {
 /**
  * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
  * {{{
- *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
- *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2

Review comment:
       This seems like we push down the aggregate operator through the join operator. It's usually beneficial, but can cause perf regression if the join children do not have many duplications and the aggregate operator can't reduce the data volume. It also adds one more shuffle.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757483305


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38481/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757630823


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133901/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-800152464


   **[Test build #136112 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136112/testReport)** for PR 31113 at commit [`f55d0eb`](https://github.com/apache/spark/commit/f55d0ebb762d95a731f0b8e12d75f6f96b2ec48c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-825041075


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137812/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-825135564


   **[Test build #137814 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137814/testReport)** for PR 31113 at commit [`941a59d`](https://github.com/apache/spark/commit/941a59daa138a94ce8ace2f639dd0698d08294fa).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `trait ReplaceSetOperationRule extends Rule[LogicalPlan] `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum closed pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #31113:
URL: https://github.com/apache/spark/pull/31113


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31113:
URL: https://github.com/apache/spark/pull/31113#discussion_r589321776



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1663,8 +1663,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] {
 /**
  * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
  * {{{
- *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
- *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2

Review comment:
       This seems like we push down the aggregate operator through the join operator. It's usually beneficial, but can cause perf regression if the join children do not have many duplications and the aggregate operator can't reduce the data volume. It also adds one more shuffle (now there 2 DISTINC, which is 2 aggregate operators in the plan).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31113:
URL: https://github.com/apache/spark/pull/31113#discussion_r595793043



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1663,8 +1663,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] {
 /**
  * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator.
  * {{{
- *   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
- *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2

Review comment:
       @cloud-fan What do you think? As far as I know, there is no column stats in our production. I don't know the status of other companies. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] Support push down DISTINCT in Set operations

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-800336988


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136118/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-798451114


   **[Test build #136030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136030/testReport)** for PR 31113 at commit [`6f3b87a`](https://github.com/apache/spark/commit/6f3b87a3bba6f7a827674b87b59e819af7419112).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757491286


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38482/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757529521


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133894/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757476738


   **[Test build #133892 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133892/testReport)** for PR 31113 at commit [`1dd0c1c`](https://github.com/apache/spark/commit/1dd0c1ceb9b5377eaed7372a94406528fcd82472).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31113: [SPARK-34061][SQL] DISTINCT the INTERSECT children

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31113:
URL: https://github.com/apache/spark/pull/31113#issuecomment-757494922


   **[Test build #133895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133895/testReport)** for PR 31113 at commit [`bfc35f4`](https://github.com/apache/spark/commit/bfc35f4006020b86c98f6148542748fb1cc984cd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org