You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/11/16 12:36:13 UTC

[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/23057

    [SPARK-26078][SQL] Dedup self-join attributes on IN subqueries

    ## What changes were proposed in this pull request?
    
    When there is a self-join as result of a IN subquery, the join condition may be invalid, resulting in trivially true predicates and return wrong results.
    
    The PR deduplicates the subquery output in order to avoid the issue.
    
    ## How was this patch tested?
    
    added UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-26078

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23057.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23057
    
----
commit 2af656a6b8ddae00555b04ecdbc7873adc6fc0b6
Author: Marco Gaido <ma...@...>
Date:   2018-11-16T12:27:35Z

    [SPARK-26078][SQL] Dedup self-join attributes on subqueries

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23057#discussion_r235237635
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala ---
    @@ -70,6 +67,26 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
         case _ => joinPlan
       }
     
    +  private def rewriteDedupPlan(plan: LogicalPlan, rewrites: AttributeMap[Alias]): LogicalPlan = {
    +    val aliasedExpressions = plan.output.map { ref =>
    +      rewrites.getOrElse(ref, ref)
    +    }
    +    Project(aliasedExpressions, plan)
    +  }
    +
    +  private def dedupSubqueryOnSelfJoin(values: Seq[Expression], sub: LogicalPlan): LogicalPlan = {
    --- End diff --
    
    Add a simple code comment for this method?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    **[Test build #98914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98914/testReport)** for PR 23057 at commit [`a71b1c6`](https://github.com/apache/spark/commit/a71b1c6abd566e52063b3fb0343db5178ac67c8f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98913/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    @cloud-fan @gatorsmile may you please take a look at this? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23057#discussion_r234409196
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala ---
    @@ -119,7 +139,7 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
               // (A.A1 = B.B1 OR ISNULL(A.A1 = B.B1)) AND (B.B2 = A.A2) AND B.B3 > 1
               val finalJoinCond = (nullAwareJoinConds ++ conditions).reduceLeft(And)
               // Deduplicate conflicting attributes if any.
    -          dedupJoin(Join(outerPlan, sub, LeftAnti, Option(finalJoinCond)))
    +          dedupJoin(Join(outerPlan, newSub, LeftAnti, Option(finalJoinCond)))
             case (p, predicate) =>
               val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p)
               Project(p.output, Filter(newCond.get, inputPlan))
    --- End diff --
    
    In `rewriteExistentialExpr`, there is a similar logic for `InSubquery`. Should we also do `dedupSubqueryOnSelfJoin` for it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    cc @gatorsmile too


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23057#discussion_r234410124
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala ---
    @@ -119,7 +139,7 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
               // (A.A1 = B.B1 OR ISNULL(A.A1 = B.B1)) AND (B.B2 = A.A2) AND B.B3 > 1
               val finalJoinCond = (nullAwareJoinConds ++ conditions).reduceLeft(And)
               // Deduplicate conflicting attributes if any.
    -          dedupJoin(Join(outerPlan, sub, LeftAnti, Option(finalJoinCond)))
    +          dedupJoin(Join(outerPlan, newSub, LeftAnti, Option(finalJoinCond)))
             case (p, predicate) =>
               val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p)
               Project(p.output, Filter(newCond.get, inputPlan))
    --- End diff --
    
    mmmh...`rewriteExistentialExpr` operates on the result of the `foldLeft`,so every `InSubquery` there was already transformed using `dedupSubqueryOnSelfJoin`, right? So I don't think it is needed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    **[Test build #98920 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98920/testReport)** for PR 23057 at commit [`a71b1c6`](https://github.com/apache/spark/commit/a71b1c6abd566e52063b3fb0343db5178ac67c8f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98920/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5081/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23057#discussion_r234409212
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala ---
    @@ -1280,4 +1281,34 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
           assert(subqueries.length == 1)
         }
       }
    +
    +  test("SPARK-26078: deduplicate fake self joins for IN subqueries") {
    +    withTempView("a", "b") {
    +      val a = spark.createDataFrame(spark.sparkContext.parallelize(Seq(Row("a", 2), Row("b", 1))),
    +        StructType(Seq(StructField("id", StringType), StructField("num", IntegerType))))
    +      val b = spark.createDataFrame(spark.sparkContext.parallelize(Seq(Row("a", 2), Row("b", 1))),
    +        StructType(Seq(StructField("id", StringType), StructField("num", IntegerType))))
    --- End diff --
    
    Two schema is the same. We can define it just once?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    **[Test build #99008 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99008/testReport)** for PR 23057 at commit [`3d010fd`](https://github.com/apache/spark/commit/3d010fd911a9b07d1a9d13c79d52be186d023556).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    **[Test build #98913 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98913/testReport)** for PR 23057 at commit [`2af656a`](https://github.com/apache/spark/commit/2af656a6b8ddae00555b04ecdbc7873adc6fc0b6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    **[Test build #98920 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98920/testReport)** for PR 23057 at commit [`a71b1c6`](https://github.com/apache/spark/commit/a71b1c6abd566e52063b3fb0343db5178ac67c8f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98969/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5078/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    The change looks fine to me. cc @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    **[Test build #99111 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99111/testReport)** for PR 23057 at commit [`65fca4f`](https://github.com/apache/spark/commit/65fca4f8b5ae347fc3f1baf3069b92aaab1b00db).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    **[Test build #99111 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99111/testReport)** for PR 23057 at commit [`65fca4f`](https://github.com/apache/spark/commit/65fca4f8b5ae347fc3f1baf3069b92aaab1b00db).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    any comments @cloud-fan ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23057#discussion_r234547323
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala ---
    @@ -119,7 +139,7 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
               // (A.A1 = B.B1 OR ISNULL(A.A1 = B.B1)) AND (B.B2 = A.A2) AND B.B3 > 1
               val finalJoinCond = (nullAwareJoinConds ++ conditions).reduceLeft(And)
               // Deduplicate conflicting attributes if any.
    -          dedupJoin(Join(outerPlan, sub, LeftAnti, Option(finalJoinCond)))
    +          dedupJoin(Join(outerPlan, newSub, LeftAnti, Option(finalJoinCond)))
             case (p, predicate) =>
               val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p)
               Project(p.output, Filter(newCond.get, inputPlan))
    --- End diff --
    
    thanks for your help here @viirya. I added the check also to `rewriteExistentialExpr`. I was missing the case when it is invoked not only on the result of `foldLeft`. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98914/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    **[Test build #98914 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98914/testReport)** for PR 23057 at commit [`a71b1c6`](https://github.com/apache/spark/commit/a71b1c6abd566e52063b3fb0343db5178ac67c8f).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    any more comments @cloud-fan  @viirya ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    cc @cloud-fan @viirya 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5112/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    **[Test build #98969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98969/testReport)** for PR 23057 at commit [`86106fa`](https://github.com/apache/spark/commit/86106fadcaed6c1a4768138b3d72e8c892b7cd7f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    **[Test build #98913 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98913/testReport)** for PR 23057 at commit [`2af656a`](https://github.com/apache/spark/commit/2af656a6b8ddae00555b04ecdbc7873adc6fc0b6).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5222/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5140/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23057#discussion_r234521306
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala ---
    @@ -119,7 +139,7 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
               // (A.A1 = B.B1 OR ISNULL(A.A1 = B.B1)) AND (B.B2 = A.A2) AND B.B3 > 1
               val finalJoinCond = (nullAwareJoinConds ++ conditions).reduceLeft(And)
               // Deduplicate conflicting attributes if any.
    -          dedupJoin(Join(outerPlan, sub, LeftAnti, Option(finalJoinCond)))
    +          dedupJoin(Join(outerPlan, newSub, LeftAnti, Option(finalJoinCond)))
             case (p, predicate) =>
               val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p)
               Project(p.output, Filter(newCond.get, inputPlan))
    --- End diff --
    
    this fails indeed. I'll investigate it, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99008/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5079/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23057#discussion_r234412635
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala ---
    @@ -119,7 +139,7 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
               // (A.A1 = B.B1 OR ISNULL(A.A1 = B.B1)) AND (B.B2 = A.A2) AND B.B3 > 1
               val finalJoinCond = (nullAwareJoinConds ++ conditions).reduceLeft(And)
               // Deduplicate conflicting attributes if any.
    -          dedupJoin(Join(outerPlan, sub, LeftAnti, Option(finalJoinCond)))
    +          dedupJoin(Join(outerPlan, newSub, LeftAnti, Option(finalJoinCond)))
             case (p, predicate) =>
               val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p)
               Project(p.output, Filter(newCond.get, inputPlan))
    --- End diff --
    
    Can you try this test case?
    
    ```scala
    val df1 = spark.sql(
            """
              |SELECT id,num,source FROM (
              |  SELECT id, num, 'a' as source FROM a
              |  UNION ALL
              |  SELECT id, num, 'b' as source FROM b
              |) AS c WHERE c.id IN (SELECT id FROM b WHERE num = 2) OR
              |c.id IN (SELECT id FROM b WHERE num = 3)
            """.stripMargin)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23057: [SPARK-26078][SQL] Dedup self-join attributes on ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23057#discussion_r234409158
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala ---
    @@ -70,6 +67,27 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
         case _ => joinPlan
       }
     
    +  private def rewriteDedupPlan(plan: LogicalPlan, rewrites: AttributeMap[Alias]): LogicalPlan = {
    +    val aliasedExpressions = plan.output.map { ref =>
    +      rewrites.getOrElse(ref, ref)
    +    }
    +    Project(aliasedExpressions, plan)
    +  }
    +
    +  private def dedupSubqueryOnSelfJoin(values: Seq[Expression], sub: LogicalPlan): LogicalPlan = {
    +    val leftRefs = AttributeSet.fromAttributeSets(values.map(_.references))
    +    val rightRefs = AttributeSet(sub.output)
    --- End diff --
    
    This is just `outputSet`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    **[Test build #98969 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98969/testReport)** for PR 23057 at commit [`86106fa`](https://github.com/apache/spark/commit/86106fadcaed6c1a4768138b3d72e8c892b7cd7f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    @mccheah this is waiting for reviews by committers


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    thanks @viirya , I added a comment


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by mccheah <gi...@git.apache.org>.
Github user mccheah commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Is this ready to merge?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Thanks @mgaido91. I will review this tomorrow.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    **[Test build #99008 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99008/testReport)** for PR 23057 at commit [`3d010fd`](https://github.com/apache/spark/commit/3d010fd911a9b07d1a9d13c79d52be186d023556).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99111/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23057
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org