You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by nongli <gi...@git.apache.org> on 2015/12/09 00:55:11 UTC

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

GitHub user nongli opened a pull request:

    https://github.com/apache/spark/pull/10209

    [SPARK-9372] [SQL] For joins, insert IS NOT NULL filters to children.

    Some join types and conditions imply that the join keys cannot be NULL and
    can be filtered out by the children. This patch does this for inner joins
    and introduces a mechanism to generate predicates. The complex part of doing
    this is to make sure the transformation is stable. The problem that we want
    to avoid is generating a filter in the join, having that pushed down and then
    having the join regenerate the filter.
    
    This patch solves this by having the join remember predicates that it has
    generated. This mechanism should be general enough that we can infer other
    predicates, for example "a join b where a.id = b.id AND a.id = 10" could
    also use this mechanism to generate the predicate "b.id = 10".

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nongli/spark spark-9372

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10209
    
----
commit 565d15ec8f0c319a97370b133317c11140da49b4
Author: Nong Li <no...@databricks.com>
Date:   2015-12-08T23:50:25Z

    [SPARK-9372] [SQL] For joins, insert IS NOT NULL filters to children.
    
    Some join types and conditions imply that the join keys cannot be NULL and
    can be filtered out by the children. This patch does this for inner joins
    and introduces a mechanism to generate predicates. The complex part of doing
    this is to make sure the transformation is stable. The problem that we want
    to avoid is generating a filter in the join, having that pushed down and then
    having the join regenerate the filter.
    
    This patch solves this by having the join remember predicates that it has
    generated. This mechanism should be general enough that we can infer other
    predicates, for example "a join b where a.id = b.id AND a.id = 10" could
    also use this mechanism to generate the predicate "b.id = 10".

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163065565
  
    **[Test build #47375 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47375/consoleFull)** for PR 10209 at commit [`565d15e`](https://github.com/apache/spark/commit/565d15ec8f0c319a97370b133317c11140da49b4).
     * This patch **fails to build**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163732157
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47531/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10209#discussion_r47268048
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala ---
    @@ -43,7 +43,7 @@ abstract class PlanTest extends SparkFunSuite {
       protected def comparePlans(plan1: LogicalPlan, plan2: LogicalPlan) {
         val normalized1 = normalizeExprIds(plan1)
         val normalized2 = normalizeExprIds(plan2)
    -    if (normalized1 != normalized2) {
    +    if (!normalized1.semanticEquals(normalized2)) {
    --- End diff --
    
    Existing: do we need this hacky normalization logic above anymore?  I don't think `semanticEquals` existed when I wrote this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10209#discussion_r47227527
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -356,6 +360,51 @@ object LikeSimplification extends Rule[LogicalPlan] {
     }
     
     /**
    + * This rule adds IsNotNull predicates based on join keys. If the join contains a condition
    + * `a` binaryOp *, a is non-nullable. This adds filters for those attributes to the children.
    + *
    + * To avoid the problem of repeatedly generating the IsNotNull predicates, the join operator
    + * remembers all the expressions it generated.
    + */
    +object AddJoinKeyNullabilityFilters extends Rule[LogicalPlan] with PredicateHelper {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case j @ Join(left, right, Inner, Some(condition), generated) => {
    +      val nonNullableKeys = ExtractNonNullableAttributes.unapply(condition)
    --- End diff --
    
    nit: `val ExtractNonNullableAttributes(nonNullableKeys) = condition`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163065568
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47375/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10209#discussion_r47294889
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala ---
    @@ -122,11 +122,22 @@ case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(le
       override def output: Seq[Attribute] = left.output
     }
     
    +object Join {
    +  def apply(
    +    left: LogicalPlan,
    +    right: LogicalPlan,
    +    joinType: JoinType,
    +    condition: Option[Expression]): Join = {
    +    Join(left, right, joinType, condition, None)
    +  }
    +}
    +
     case class Join(
       left: LogicalPlan,
       right: LogicalPlan,
       joinType: JoinType,
    -  condition: Option[Expression]) extends BinaryNode {
    +  condition: Option[Expression],
    +  generatedExpressions: Option[EquivalentExpressions]) extends BinaryNode {
    --- End diff --
    
    Couldn't `Literal(5)` be in the equivalence class and we could check for that?
    
    That said, I also like the idea more general value constraints.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163710755
  
    **[Test build #47531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47531/consoleFull)** for PR 10209 at commit [`fb562fb`](https://github.com/apache/spark/commit/fb562fb67a761276456b14a81513f3fc69a6ead8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10209#discussion_r47267922
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala ---
    @@ -122,11 +122,22 @@ case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(le
       override def output: Seq[Attribute] = left.output
     }
     
    +object Join {
    +  def apply(
    +    left: LogicalPlan,
    +    right: LogicalPlan,
    +    joinType: JoinType,
    +    condition: Option[Expression]): Join = {
    +    Join(left, right, joinType, condition, None)
    +  }
    +}
    +
     case class Join(
       left: LogicalPlan,
       right: LogicalPlan,
       joinType: JoinType,
    -  condition: Option[Expression]) extends BinaryNode {
    +  condition: Option[Expression],
    +  generatedExpressions: Option[EquivalentExpressions]) extends BinaryNode {
    --- End diff --
    
    This is semi-public API cause I think some advanced projects do dig into catalyst and we've never changed the signature of something as basic as `Join` before.  Could we do this instead by fixing nullablity propagation and only inserting the filter if the attribute is `nullable`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10209


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10209#discussion_r47314107
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala ---
    @@ -122,11 +122,22 @@ case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(le
       override def output: Seq[Attribute] = left.output
     }
     
    +object Join {
    +  def apply(
    +    left: LogicalPlan,
    +    right: LogicalPlan,
    +    joinType: JoinType,
    +    condition: Option[Expression]): Join = {
    +    Join(left, right, joinType, condition, None)
    +  }
    +}
    +
     case class Join(
       left: LogicalPlan,
       right: LogicalPlan,
       joinType: JoinType,
    -  condition: Option[Expression]) extends BinaryNode {
    +  condition: Option[Expression],
    +  generatedExpressions: Option[EquivalentExpressions]) extends BinaryNode {
    --- End diff --
    
    I was thinking operators would propagate the set of constraints up from their children (possibly augmenting or clearing as appropriate) and we'd save it in a `lazy val`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10209#discussion_r47309024
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala ---
    @@ -122,11 +122,22 @@ case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(le
       override def output: Seq[Attribute] = left.output
     }
     
    +object Join {
    +  def apply(
    +    left: LogicalPlan,
    +    right: LogicalPlan,
    +    joinType: JoinType,
    +    condition: Option[Expression]): Join = {
    +    Join(left, right, joinType, condition, None)
    +  }
    +}
    +
     case class Join(
       left: LogicalPlan,
       right: LogicalPlan,
       joinType: JoinType,
    -  condition: Option[Expression]) extends BinaryNode {
    +  condition: Option[Expression],
    +  generatedExpressions: Option[EquivalentExpressions]) extends BinaryNode {
    --- End diff --
    
    Literal(5) would work for equivalence but we want to track more than equality. If it was t1.key join t2.key where t1.key = t2.key and t1.key > 5, we'd similarly want to add t2.key > 5. 
    
    Are you suggesting we don't change the operator and walk the tree bottom up to collect these constraints? This seems extremely expensive to do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163507143
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47487/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163514568
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47488/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163732156
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163062943
  
    **[Test build #47375 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47375/consoleFull)** for PR 10209 at commit [`565d15e`](https://github.com/apache/spark/commit/565d15ec8f0c319a97370b133317c11140da49b4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-217503882
  
    Is this already fixed by https://github.com/apache/spark/pull/7768 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163514567
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163732061
  
    **[Test build #47531 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47531/consoleFull)** for PR 10209 at commit [`fb562fb`](https://github.com/apache/spark/commit/fb562fb67a761276456b14a81513f3fc69a6ead8).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10209#discussion_r47291050
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala ---
    @@ -122,11 +122,22 @@ case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(le
       override def output: Seq[Attribute] = left.output
     }
     
    +object Join {
    +  def apply(
    +    left: LogicalPlan,
    +    right: LogicalPlan,
    +    joinType: JoinType,
    +    condition: Option[Expression]): Join = {
    +    Join(left, right, joinType, condition, None)
    +  }
    +}
    +
     case class Join(
       left: LogicalPlan,
       right: LogicalPlan,
       joinType: JoinType,
    -  condition: Option[Expression]) extends BinaryNode {
    +  condition: Option[Expression],
    +  generatedExpressions: Option[EquivalentExpressions]) extends BinaryNode {
    --- End diff --
    
    I started down that path but can you think of a way to make that handle the more general case of predicate propagation?
    
    t1.key join t2.key where t1.key = t2.key and t1.key = 5.
    
    How do we generate the predicate t2.key = 5? how do we make this more general? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163514549
  
    **[Test build #47488 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47488/consoleFull)** for PR 10209 at commit [`00e957c`](https://github.com/apache/spark/commit/00e957c83ca2983941170e503e926535f2c8a91c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by nongli <gi...@git.apache.org>.

Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10209#discussion_r47293719
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala ---
    @@ -122,11 +122,22 @@ case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(le
       override def output: Seq[Attribute] = left.output
     }
     
    +object Join {
    +  def apply(
    +    left: LogicalPlan,
    +    right: LogicalPlan,
    +    joinType: JoinType,
    +    condition: Option[Expression]): Join = {
    +    Join(left, right, joinType, condition, None)
    +  }
    +}
    +
     case class Join(
       left: LogicalPlan,
       right: LogicalPlan,
       joinType: JoinType,
    -  condition: Option[Expression]) extends BinaryNode {
    +  condition: Option[Expression],
    +  generatedExpressions: Option[EquivalentExpressions]) extends BinaryNode {
    --- End diff --
    
    Equivalence classes is one thing, we can compute that no problem I think. The issue is how to remember that t2.key = 5 was generated and not to generate it again. The trick of setting nullable doesn't work here. We could maintain value constraints (where nullability is a subset).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10209#discussion_r47226377
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala ---
    @@ -190,14 +190,15 @@ trait CheckAnalysis {
                      | ${exprs.map(_.prettyString).mkString(",")}""".stripMargin)
     
               // Special handling for cases when self-join introduce duplicate expression ids.
    -          case j @ Join(left, right, _, _) if left.outputSet.intersect(right.outputSet).nonEmpty =>
    -            val conflictingAttributes = left.outputSet.intersect(right.outputSet)
    -            failAnalysis(
    -              s"""
    -                 |Failure when resolving conflicting references in Join:
    -                 |$plan
    -                 |Conflicting attributes: ${conflictingAttributes.mkString(",")}
    -                 |""".stripMargin)
    +          case j @ Join(left, right, _, _, _)
    +            if left.outputSet.intersect(right.outputSet).nonEmpty =>
    --- End diff --
    
    nit: we can use `if !j.selfJoinResolved`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10209#discussion_r47228437
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -356,6 +360,51 @@ object LikeSimplification extends Rule[LogicalPlan] {
     }
     
     /**
    + * This rule adds IsNotNull predicates based on join keys. If the join contains a condition
    + * `a` binaryOp *, a is non-nullable. This adds filters for those attributes to the children.
    + *
    + * To avoid the problem of repeatedly generating the IsNotNull predicates, the join operator
    + * remembers all the expressions it generated.
    + */
    +object AddJoinKeyNullabilityFilters extends Rule[LogicalPlan] with PredicateHelper {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case j @ Join(left, right, Inner, Some(condition), generated) => {
    +      val nonNullableKeys = ExtractNonNullableAttributes.unapply(condition)
    --- End diff --
    
    oh nvm, I think your version looks more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163627935
  
    This looks very cool!
    But I'm a little worried about making the `Join` stateful, how about adding a new batch in `Optimizer` and only execute it `Once`? So that we can avoid having the join regenerate the filter as your rule will be applied after all other optimization rules.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163065567
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10209#discussion_r47268378
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -99,6 +99,13 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
        */
       lazy val resolved: Boolean = expressions.forall(_.resolved) && childrenResolved
     
    +  /**
    +   * Returns true if the two plans are semantically equal. This should ignore state generated
    +   * during planning to help the planning process.
    +   * TODO: implement this as a pass that canonicalizes the plan tree instead?
    +   */
    +  def semanticEquals(other: LogicalPlan): Boolean = this == other
    --- End diff --
    
    Oh, this is a new semantic equals.  How is this different than `sameResult`?  Maybe we should unify the naming between Expression and LogicalPlan for this concept.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163507134
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10209#issuecomment-163507570
  
    **[Test build #47488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47488/consoleFull)** for PR 10209 at commit [`00e957c`](https://github.com/apache/spark/commit/00e957c83ca2983941170e503e926535f2c8a91c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9372] [SQL] For joins, insert IS NOT NU...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10209#discussion_r47291821
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala ---
    @@ -122,11 +122,22 @@ case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(le
       override def output: Seq[Attribute] = left.output
     }
     
    +object Join {
    +  def apply(
    +    left: LogicalPlan,
    +    right: LogicalPlan,
    +    joinType: JoinType,
    +    condition: Option[Expression]): Join = {
    +    Join(left, right, joinType, condition, None)
    +  }
    +}
    +
     case class Join(
       left: LogicalPlan,
       right: LogicalPlan,
       joinType: JoinType,
    -  condition: Option[Expression]) extends BinaryNode {
    +  condition: Option[Expression],
    +  generatedExpressions: Option[EquivalentExpressions]) extends BinaryNode {
    --- End diff --
    
    In an earlier version of catalyst we also had equivalence classes propagate up the logical plans.  Would that give you enough information?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org