You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/17 19:05:10 UTC

[GitHub] [spark] c21 opened a new pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

c21 opened a new pull request #34034:
URL: https://github.com/apache/spark/pull/34034


   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we only need to keep one row per unique join key(s) inside hash table (`HashedRelation`) when building the hash table. This can help reduce the size of hash table of join.
   
   This PR adds the optimization in `UnsafeHashedRelation` for broadcast hash join and shuffled hash join. The optimization for `LongHashedRelation` would be added later in the future, because it needs more change of underlying hash table data structure `LongToUnsafeRowMap` to check if key exists in hash table or not.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   Help reduce the hash table size of join for LEFT SEMI and LEFT ANTI.
   This can increase the chance of broadcast join of these queries, and reduce OOM possibility of shuffled hash join.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Added unit test in `JoinSuite.scala`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923409220


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47972/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923146032


   > The idea LGTM. Does it apply to other join types?
   
   @cloud-fan - I think we can apply similar thing to INNER and LEFT/RIGHT OUTER join, as long as the join does not have extra join condition. But for INNER and OUTER join, we need to output every matching rows from build side, so we need some extra data structure to indicate number of rows per unique join key (similar to `BitSet` as we introduced in `ShuffledHashJoinExec.fullOuterJoinWithUniqueKey`). Shall I make the change later for INNER and LEFT/RIGHT OUTER join?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923800737


   **[Test build #143472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143472/testReport)** for PR 34034 at commit [`25eed04`](https://github.com/apache/spark/commit/25eed045a4d745e59338135faf79770f02d98925).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928766895


   **[Test build #143668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143668/testReport)** for PR 34034 at commit [`16fa504`](https://github.com/apache/spark/commit/16fa504a1bbee9935952d6d771b311d74231b6da).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928819562


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48180/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923578589


   **[Test build #143466 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143466/testReport)** for PR 34034 at commit [`b26fab1`](https://github.com/apache/spark/commit/b26fab108a04ff445f452f610037602a7993184b).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923490975


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47977/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #34034:
URL: https://github.com/apache/spark/pull/34034#discussion_r712695659



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
##########
@@ -1402,4 +1402,28 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       assertJoin(sql, classOf[ShuffledHashJoinExec])
     }
   }
+
+  test("SPARK-36794: Ignore duplicated key when building relation for semi/anti hash join") {

Review comment:
       > but this test passes with and without the patch, right?
   
   Yes. This unit test is added mostly to verify after adding this PR, the join still works as expected. And this PR is not fixing a regression, but just an improvement.
   
   > Seems there isn't a way to show the difference?
   
   Well in theory we can add more code to check the number of rows inside `HashedRelation`, and this should have a difference before/after this PR. However this would need more code change, e.g. introducing a new SQL metrics for number of rows inside `HashedRelation`, which looks like unnecessary.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923579349


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143466/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #34034:
URL: https://github.com/apache/spark/pull/34034#discussion_r712695659



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
##########
@@ -1402,4 +1402,28 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       assertJoin(sql, classOf[ShuffledHashJoinExec])
     }
   }
+
+  test("SPARK-36794: Ignore duplicated key when building relation for semi/anti hash join") {

Review comment:
       > but this test passes with and without the patch, right?
   
   Yes. This unit test is added mostly to verify after adding this PR, the join still works as expected. And this PR is not fixing a regression, but just an improvement.
   
   > Seems there isn't a way to show the difference?
   
   Well in theory we can add more code to check the number of rows inside `HashedRelation`, and this should have a difference before/after this PR. However this would need more code change, e.g. introducing a new SQL metrics for number of rows inside `HashedRelation`, which looks like not unnecessary.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #34034:
URL: https://github.com/apache/spark/pull/34034


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-926473495


   **[Test build #143598 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143598/testReport)** for PR 34034 at commit [`b22479a`](https://github.com/apache/spark/commit/b22479a85a4d7992b81305df27af50c9915dabf0).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-922020228


   **[Test build #143424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143424/testReport)** for PR 34034 at commit [`f73c393`](https://github.com/apache/spark/commit/f73c393e35490d5ba59dab2e54ce6acc2897f7e8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923579349


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143466/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923802227


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143472/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923489599


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47977/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-926624665


   **[Test build #143598 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143598/testReport)** for PR 34034 at commit [`b22479a`](https://github.com/apache/spark/commit/b22479a85a4d7992b81305df27af50c9915dabf0).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-922042686


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47931/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923453550


   **[Test build #143461 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143461/testReport)** for PR 34034 at commit [`b4bd1ef`](https://github.com/apache/spark/commit/b4bd1ef0fd700b52caaae642d91f1c6969d5da86).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923688115


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47983/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923442661


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47977/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-922077574


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143424/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 edited a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 edited a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923146032


   > The idea LGTM. Does it apply to other join types?
   
   @cloud-fan - actually it does not apply to other join types, because for INNER and OUTER join, we need to output every matching rows from build side. Rows from build side may have same value for join column, but not same value for other column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-922048059


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47931/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34034:
URL: https://github.com/apache/spark/pull/34034#discussion_r714962626



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -158,6 +158,11 @@ trait HashJoin extends JoinCodegenSupport {
         output, (streamedPlan.output ++ buildPlan.output).map(_.withNullability(true)))
   }
 
+  @transient protected lazy val ignoreDuplicatedKey = joinType match {
+    case LeftExistence(_) if condition.isEmpty => true

Review comment:
       shall we make it more accurate? I think we can apply this optimization if the join condition only refers to the join keys, e.g. `t1 LEFT JOIN t2 ON t1.key = t2.key AND t2.key > 3`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #34034:
URL: https://github.com/apache/spark/pull/34034#discussion_r715448828



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -158,6 +158,11 @@ trait HashJoin extends JoinCodegenSupport {
         output, (streamedPlan.output ++ buildPlan.output).map(_.withNullability(true)))
   }
 
+  @transient protected lazy val ignoreDuplicatedKey = joinType match {
+    case LeftExistence(_) if condition.isEmpty => true

Review comment:
       @cloud-fan - good point. Updated, and also tested it in `JoinSuite.scala`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34034:
URL: https://github.com/apache/spark/pull/34034#discussion_r716431440



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
##########
@@ -1402,4 +1402,73 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       assertJoin(sql, classOf[ShuffledHashJoinExec])
     }
   }
+
+  test("SPARK-36794: Ignore duplicated key when building relation for semi/anti hash join") {
+    withTable("t1", "t2") {
+      spark.range(10).map(i => (i.toString, i + 1)).toDF("c1", "c2").write.saveAsTable("t1")
+      spark.range(10).map(i => ((i % 5).toString, i % 3)).toDF("c1", "c2").write.saveAsTable("t2")
+
+      Seq("BROADCAST", "SHUFFLE_HASH").foreach {
+        joinHint =>
+          val semiJoinDFs = Seq(
+            // No join condition, ignore duplicated key.
+            (sql(
+              s"SELECT /*+ $joinHint(t2) */ t1.c1 FROM t1 LEFT SEMI JOIN t2 ON t1.c1 = t2.c1"),
+              true),
+            // Have join condition on build join key only, ignore duplicated key.
+            (sql(
+              s"""
+                  |SELECT /*+ $joinHint(t2) */ t1.c1 FROM t1 LEFT SEMI JOIN t2
+                  |ON t1.c1 = t2.c1 AND CAST(t1.c2 * 2 AS STRING) != t2.c1
+               """.stripMargin),
+              true),
+            // Have join condition on other build attribute beside join key, do not ignore
+            // duplicated key.
+            (sql(
+              s"""
+                 |SELECT /*+ $joinHint(t2) */ t1.c1 FROM t1 LEFT SEMI JOIN t2
+                 |ON t1.c1 = t2.c1 AND t1.c2 * 100 != t2.c2
+               """.stripMargin),
+              false)
+          )
+          val antiJoinDFs = Seq(

Review comment:
       can we generate anti join SQL query by `.replace("SEMI", "ANTI")`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928869000


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48180/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #34034:
URL: https://github.com/apache/spark/pull/34034


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928887987






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-939862907


   Unfortunately, this breaks broadcast reuse which causes perf regression. To reproduce
   ```
   scala> val df1 = spark.range(1000)
   df1: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> val df2 = spark.range(100)
   df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> val j1 = df1.join(df2, Seq("id"), "inner")
   j1: org.apache.spark.sql.DataFrame = [id: bigint]
   
   scala> val j2 = df1.join(df2, Seq("id"), "left_semi")
   j2: org.apache.spark.sql.DataFrame = [id: bigint]
   
   scala> val res = j1.union(j2)
   res: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint]
   
   scala> res.collect()
   res0: Array[org.apache.spark.sql.Row] = Array([0], ...
   
   scala> res.explain
   ```
   
   Before this PR, the query plan was
   ```
   AdaptiveSparkPlan isFinalPlan=true
   +- == Final Plan ==
      Union
      :- *(3) Project [id#0L]
      :  +- *(3) BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, false
      :     :- *(3) Range (0, 1000, step=1, splits=1)
      :     +- BroadcastQueryStage 0
      :        +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false,false), [id=#79]
      :           +- *(1) Range (0, 100, step=1, splits=1)
      +- *(4) BroadcastHashJoin [id#12L], [id#13L], LeftSemi, BuildRight, false
         :- *(4) Range (0, 1000, step=1, splits=1)
         +- BroadcastQueryStage 2
            +- ReusedExchange [id#13L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false,false), [id=#79]
   ```
   
   Now it's
   ```
   == Physical Plan ==
   +- == Final Plan ==
      Union
      :- *(3) Project [id#0L]
      :  +- *(3) BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, false
      :     :- *(3) Range (0, 1000, step=1, splits=1)
      :     +- BroadcastQueryStage 0
      :        +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false,false), [id=#41]
      :           +- *(1) Range (0, 100, step=1, splits=1)
      +- *(4) BroadcastHashJoin [id#6L], [id#7L], LeftSemi, BuildRight, false
         :- *(4) Range (0, 1000, step=1, splits=1)
         +- BroadcastQueryStage 1
            +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false,true), [id=#50]
               +- *(2) Range (0, 100, step=1, splits=1)
   ```
   
   Ignore duplicated key is a small improvement and broadcast reuse is definitely more important to the query performance. I'm reverting this first. Please re-propose this optimization without breaking broadcast reuse.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #34034:
URL: https://github.com/apache/spark/pull/34034#discussion_r717205037



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
##########
@@ -1402,4 +1402,73 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       assertJoin(sql, classOf[ShuffledHashJoinExec])
     }
   }
+
+  test("SPARK-36794: Ignore duplicated key when building relation for semi/anti hash join") {
+    withTable("t1", "t2") {
+      spark.range(10).map(i => (i.toString, i + 1)).toDF("c1", "c2").write.saveAsTable("t1")
+      spark.range(10).map(i => ((i % 5).toString, i % 3)).toDF("c1", "c2").write.saveAsTable("t2")
+
+      Seq("BROADCAST", "SHUFFLE_HASH").foreach {
+        joinHint =>
+          val semiJoinDFs = Seq(
+            // No join condition, ignore duplicated key.
+            (sql(
+              s"SELECT /*+ $joinHint(t2) */ t1.c1 FROM t1 LEFT SEMI JOIN t2 ON t1.c1 = t2.c1"),
+              true),
+            // Have join condition on build join key only, ignore duplicated key.
+            (sql(
+              s"""
+                  |SELECT /*+ $joinHint(t2) */ t1.c1 FROM t1 LEFT SEMI JOIN t2
+                  |ON t1.c1 = t2.c1 AND CAST(t1.c2 * 2 AS STRING) != t2.c1
+               """.stripMargin),
+              true),
+            // Have join condition on other build attribute beside join key, do not ignore
+            // duplicated key.
+            (sql(
+              s"""
+                 |SELECT /*+ $joinHint(t2) */ t1.c1 FROM t1 LEFT SEMI JOIN t2
+                 |ON t1.c1 = t2.c1 AND t1.c2 * 100 != t2.c2
+               """.stripMargin),
+              false)
+          )
+          val antiJoinDFs = Seq(

Review comment:
       @cloud-fan - sure, updated.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -158,6 +158,17 @@ trait HashJoin extends JoinCodegenSupport {
         output, (streamedPlan.output ++ buildPlan.output).map(_.withNullability(true)))
   }
 
+  // Exposed for testing
+  @transient lazy val ignoreDuplicatedKey = joinType match {
+    case LeftExistence(_) =>
+      // For building hash relation, ignore duplicated rows with same join keys if:
+      // 1. Join condition is empty.
+      // 2. Join condition only references streamed attributes and build join keys.

Review comment:
       @viirya - yes, updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928932430


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-922020228


   **[Test build #143424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143424/testReport)** for PR 34034 at commit [`f73c393`](https://github.com/apache/spark/commit/f73c393e35490d5ba59dab2e54ce6acc2897f7e8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-922762500


   The idea LGTM. Does it apply to other join types?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923490975


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47977/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923414662


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47972/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923695586


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47983/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-922017923


   cc @cloud-fan, @maropu and @viirya - could you help take a look when you have time? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928887987


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48180/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923648413


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47983/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-922077574


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143424/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-926473495


   **[Test build #143598 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143598/testReport)** for PR 34034 at commit [`b22479a`](https://github.com/apache/spark/commit/b22479a85a4d7992b81305df27af50c9915dabf0).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-929008972


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143668/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923455671


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143461/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-922077404


   **[Test build #143424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143424/testReport)** for PR 34034 at commit [`f73c393`](https://github.com/apache/spark/commit/f73c393e35490d5ba59dab2e54ce6acc2897f7e8).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `case class HashedRelationBroadcastMode(`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34034:
URL: https://github.com/apache/spark/pull/34034#discussion_r716897033



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -158,6 +158,17 @@ trait HashJoin extends JoinCodegenSupport {
         output, (streamedPlan.output ++ buildPlan.output).map(_.withNullability(true)))
   }
 
+  // Exposed for testing
+  @transient lazy val ignoreDuplicatedKey = joinType match {
+    case LeftExistence(_) =>
+      // For building hash relation, ignore duplicated rows with same join keys if:
+      // 1. Join condition is empty.
+      // 2. Join condition only references streamed attributes and build join keys.

Review comment:
       We only need to meet any one requirement? So maybe say 
   
   ```
   1. Join condition is empty, or
   2. ...
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928887987






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923362144


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47972/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-940408227


   +1 for reverting it first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923415570


   **[Test build #143466 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143466/testReport)** for PR 34034 at commit [`b26fab1`](https://github.com/apache/spark/commit/b26fab108a04ff445f452f610037602a7993184b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-929008972


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143668/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923318116


   **[Test build #143461 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143461/testReport)** for PR 34034 at commit [`b4bd1ef`](https://github.com/apache/spark/commit/b4bd1ef0fd700b52caaae642d91f1c6969d5da86).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923415570


   **[Test build #143466 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143466/testReport)** for PR 34034 at commit [`b26fab1`](https://github.com/apache/spark/commit/b26fab108a04ff445f452f610037602a7993184b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923318116


   **[Test build #143461 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143461/testReport)** for PR 34034 at commit [`b4bd1ef`](https://github.com/apache/spark/commit/b4bd1ef0fd700b52caaae642d91f1c6969d5da86).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-926507247


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48110/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923455671


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143461/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-924307058


   Just FYI this PR is ready for review. Thanks. @cloud-fan, @viirya and @huaxingao.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923695586


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47983/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 edited a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 edited a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923146032


   > The idea LGTM. Does it apply to other join types?
   
   @cloud-fan - I think we can apply similar thing to INNER and LEFT/RIGHT OUTER join, as long as the join does not have extra join condition. But for INNER and OUTER join, we need to output every matching rows from build side, so we need some extra data structure to indicate number of rows per unique join key (similar to `BitSet` as we introduced in `ShuffledHashJoinExec.fullOuterJoinWithUniqueKey`). Shall I make the change later in another PR, for INNER and LEFT/RIGHT OUTER join?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923625600


   **[Test build #143472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143472/testReport)** for PR 34034 at commit [`25eed04`](https://github.com/apache/spark/commit/25eed045a4d745e59338135faf79770f02d98925).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-926540472


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48110/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-929512354


   Thank you @cloud-fan, @viirya and @huaxingao for review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-926626101


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143598/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #34034:
URL: https://github.com/apache/spark/pull/34034#discussion_r717205037



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
##########
@@ -1402,4 +1402,73 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       assertJoin(sql, classOf[ShuffledHashJoinExec])
     }
   }
+
+  test("SPARK-36794: Ignore duplicated key when building relation for semi/anti hash join") {
+    withTable("t1", "t2") {
+      spark.range(10).map(i => (i.toString, i + 1)).toDF("c1", "c2").write.saveAsTable("t1")
+      spark.range(10).map(i => ((i % 5).toString, i % 3)).toDF("c1", "c2").write.saveAsTable("t2")
+
+      Seq("BROADCAST", "SHUFFLE_HASH").foreach {
+        joinHint =>
+          val semiJoinDFs = Seq(
+            // No join condition, ignore duplicated key.
+            (sql(
+              s"SELECT /*+ $joinHint(t2) */ t1.c1 FROM t1 LEFT SEMI JOIN t2 ON t1.c1 = t2.c1"),
+              true),
+            // Have join condition on build join key only, ignore duplicated key.
+            (sql(
+              s"""
+                  |SELECT /*+ $joinHint(t2) */ t1.c1 FROM t1 LEFT SEMI JOIN t2
+                  |ON t1.c1 = t2.c1 AND CAST(t1.c2 * 2 AS STRING) != t2.c1
+               """.stripMargin),
+              true),
+            // Have join condition on other build attribute beside join key, do not ignore
+            // duplicated key.
+            (sql(
+              s"""
+                 |SELECT /*+ $joinHint(t2) */ t1.c1 FROM t1 LEFT SEMI JOIN t2
+                 |ON t1.c1 = t2.c1 AND t1.c2 * 100 != t2.c2
+               """.stripMargin),
+              false)
+          )
+          val antiJoinDFs = Seq(

Review comment:
       @cloud-fan - sure, updated.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -158,6 +158,17 @@ trait HashJoin extends JoinCodegenSupport {
         output, (streamedPlan.output ++ buildPlan.output).map(_.withNullability(true)))
   }
 
+  // Exposed for testing
+  @transient lazy val ignoreDuplicatedKey = joinType match {
+    case LeftExistence(_) =>
+      // For building hash relation, ignore duplicated rows with same join keys if:
+      // 1. Join condition is empty.
+      // 2. Join condition only references streamed attributes and build join keys.

Review comment:
       @viirya - yes, updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928766895






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-929512354


   Thank you @cloud-fan, @viirya and @huaxingao for review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928990721


   **[Test build #143668 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143668/testReport)** for PR 34034 at commit [`16fa504`](https://github.com/apache/spark/commit/16fa504a1bbee9935952d6d771b311d74231b6da).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923625600


   **[Test build #143472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143472/testReport)** for PR 34034 at commit [`25eed04`](https://github.com/apache/spark/commit/25eed045a4d745e59338135faf79770f02d98925).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923802227


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143472/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #34034:
URL: https://github.com/apache/spark/pull/34034#discussion_r712694192



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
##########
@@ -1402,4 +1402,28 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
       assertJoin(sql, classOf[ShuffledHashJoinExec])
     }
   }
+
+  test("SPARK-36794: Ignore duplicated key when building relation for semi/anti hash join") {

Review comment:
       Looks like a good idea to ignore the duplicate keys, but this test passes with and without the patch, right? Seems  there isn't a way to show the difference?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-922048030


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47931/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-922145935


   The failed unit tests are because this PR adds a new parameter for `HashedRelationBroadcastMode`, which changed a lot of query plans. Will fix them but it should not block code review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-926626101


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143598/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928766895


   **[Test build #143668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143668/testReport)** for PR 34034 at commit [`16fa504`](https://github.com/apache/spark/commit/16fa504a1bbee9935952d6d771b311d74231b6da).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-926540472


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48110/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928766895


   **[Test build #143668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143668/testReport)** for PR 34034 at commit [`16fa504`](https://github.com/apache/spark/commit/16fa504a1bbee9935952d6d771b311d74231b6da).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-927548495


   The PR is ready for review. Addressed the comment for covering broader join condition. Thanks @cloud-fan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928887987


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48180/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923397211


   Just to help easier review, as a lot of files are changed due to unit test plan change. Below files have the real code change:
   
   * BroadcastExchangeExec.scala
   * BroadcastHashJoinExec.scala
   * HashJoin.scala
   * HashedRelation.scala
   * ShuffledHashJoinExec.scala
   
   All other files change are generated with followed commands:
   
   ```
   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite"
   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite"
   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilityWithStatsSuite"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #34034:
URL: https://github.com/apache/spark/pull/34034#discussion_r716897033



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -158,6 +158,17 @@ trait HashJoin extends JoinCodegenSupport {
         output, (streamedPlan.output ++ buildPlan.output).map(_.withNullability(true)))
   }
 
+  // Exposed for testing
+  @transient lazy val ignoreDuplicatedKey = joinType match {
+    case LeftExistence(_) =>
+      // For building hash relation, ignore duplicated rows with same join keys if:
+      // 1. Join condition is empty.
+      // 2. Join condition only references streamed attributes and build join keys.

Review comment:
       We only need to meet any one requirement? So maybe say 
   
   ```
   1. Join condition is empty, or
   2. ...
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-928932430


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-922048059


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47931/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-923414662


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47972/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-926532654


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48110/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
c21 commented on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-940228124


   Thanks @cloud-fan. Yeah it's indeed a bug breaking broadcast exchange reuse and should be reverted first. Let me think about how to do a clean fix, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan edited a comment on pull request #34034: [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join

Posted by GitBox <gi...@apache.org>.
cloud-fan edited a comment on pull request #34034:
URL: https://github.com/apache/spark/pull/34034#issuecomment-939862907


   Unfortunately, this breaks broadcast reuse which causes perf regression. To reproduce
   ```
   scala> val df1 = spark.range(1000)
   df1: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> val df2 = spark.range(100)
   df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> val j1 = df1.join(df2, Seq("id"), "inner")
   j1: org.apache.spark.sql.DataFrame = [id: bigint]
   
   scala> val j2 = df1.join(df2, Seq("id"), "left_semi")
   j2: org.apache.spark.sql.DataFrame = [id: bigint]
   
   scala> val res = j1.union(j2)
   res: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint]
   
   scala> res.collect()
   res0: Array[org.apache.spark.sql.Row] = Array([0], ...
   
   scala> res.explain
   ```
   
   Before this PR, the query plan was
   ```
   AdaptiveSparkPlan isFinalPlan=true
   +- == Final Plan ==
      Union
      :- *(3) Project [id#0L]
      :  +- *(3) BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, false
      :     :- *(3) Range (0, 1000, step=1, splits=1)
      :     +- BroadcastQueryStage 0
      :        +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false,false), [id=#79]
      :           +- *(1) Range (0, 100, step=1, splits=1)
      +- *(4) BroadcastHashJoin [id#12L], [id#13L], LeftSemi, BuildRight, false
         :- *(4) Range (0, 1000, step=1, splits=1)
         +- BroadcastQueryStage 2
            +- ReusedExchange [id#13L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false,false), [id=#79]
   ```
   
   Now it's
   ```
   AdaptiveSparkPlan isFinalPlan=true
   +- == Final Plan ==
      Union
      :- *(3) Project [id#0L]
      :  +- *(3) BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, false
      :     :- *(3) Range (0, 1000, step=1, splits=1)
      :     +- BroadcastQueryStage 0
      :        +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false,false), [id=#41]
      :           +- *(1) Range (0, 100, step=1, splits=1)
      +- *(4) BroadcastHashJoin [id#6L], [id#7L], LeftSemi, BuildRight, false
         :- *(4) Range (0, 1000, step=1, splits=1)
         +- BroadcastQueryStage 1
            +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false,true), [id=#50]
               +- *(2) Range (0, 100, step=1, splits=1)
   ```
   
   Ignore duplicated key is a small improvement and broadcast reuse is definitely more important to the query performance. I'm reverting this first. Please re-propose this optimization without breaking broadcast reuse.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org