You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2018/04/17 06:32:14 UTC

[GitHub] spark pull request #21083: [SPARK-21479][SPARK-23564][SQL] infer additional ...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/21083

    [SPARK-21479][SPARK-23564][SQL] infer additional filters from constraints for join's children

    ## What changes were proposed in this pull request?
    
    The existing query constraints framework has 2 steps:
    1. propagate constraints bottom up.
    2. use constraints to infer additional filters for better data pruning.
    
    For step 2, it mostly helps with Join, because we can connect the constraints from children to the join condition and infer powerful filters to prune the data of the join sides. e.g., the left side has constraints `a = 1`, the join condition is `left.a = right.a`, then we can infer `right.a = 1` to the right side and prune the right side a lot.
    
    However, the current logic of inferring filters from constraints for Join is pretty weak. It infers the filters from Join's constraints. Some joins like left semi/anti exclude output from right side and the right side constraints will be lost here.
    
    This PR propose to check the left and right constraints individually, expand the constraints with join condition and add filters to children of join directly, instead of adding to the join condition.
    
    This reverts https://github.com/apache/spark/pull/20670 , covers https://github.com/apache/spark/pull/20717 and https://github.com/apache/spark/pull/20816
    
    This is inspired by the original PRs and the tests are all from these PRs. Thanks to the authors @mgaido91 @maryannxue @KaiXinXiaoLei !
    
    ## How was this patch tested?
    
    new tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark join

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21083.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21083
    
----
commit 2977a5e037eb862a530a777e349f328ffbda39bb
Author: Wenchen Fan <we...@...>
Date:   2018-04-16T16:15:04Z

    Revert "[SPARK-23405] Generate additional constraints for Join's children"
    
    This reverts commit cdcccd7b41c43d79edff2fec7a84cd00e9524f75.

commit b967955ec2c7d33f28845dd55a1a9b70c5c2ba03
Author: Wenchen Fan <we...@...>
Date:   2018-04-16T19:39:50Z

    fix join filter inference from constraints

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    **[Test build #89431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89431/testReport)** for PR 21083 at commit [`b967955`](https://github.com/apache/spark/commit/b967955ec2c7d33f28845dd55a1a9b70c5c2ba03).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21083: [SPARK-23564][SQL] infer additional filters from ...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21083#discussion_r183295844
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -664,53 +662,52 @@ object InferFiltersFromConstraints extends Rule[LogicalPlan] with PredicateHelpe
           }
     
         case join @ Join(left, right, joinType, conditionOpt) =>
    -      // Only consider constraints that can be pushed down completely to either the left or the
    -      // right child
    -      val constraints = join.allConstraints.filter { c =>
    -        c.references.subsetOf(left.outputSet) || c.references.subsetOf(right.outputSet)
    -      }
    -      // Remove those constraints that are already enforced by either the left or the right child
    -      val additionalConstraints = constraints -- (left.constraints ++ right.constraints)
    -      val newConditionOpt = conditionOpt match {
    -        case Some(condition) =>
    -          val newFilters = additionalConstraints -- splitConjunctivePredicates(condition)
    -          if (newFilters.nonEmpty) Option(And(newFilters.reduce(And), condition)) else conditionOpt
    -        case None =>
    -          additionalConstraints.reduceOption(And)
    -      }
    -      // Infer filter for left/right outer joins
    -      val newLeftOpt = joinType match {
    -        case RightOuter if newConditionOpt.isDefined =>
    -          val inferredConstraints = left.getRelevantConstraints(
    -            left.constraints
    -              .union(right.constraints)
    -              .union(splitConjunctivePredicates(newConditionOpt.get).toSet))
    -          val newFilters = inferredConstraints
    -            .filterNot(left.constraints.contains)
    -            .reduceLeftOption(And)
    -          newFilters.map(Filter(_, left))
    -        case _ => None
    -      }
    -      val newRightOpt = joinType match {
    -        case LeftOuter if newConditionOpt.isDefined =>
    -          val inferredConstraints = right.getRelevantConstraints(
    -            right.constraints
    -              .union(left.constraints)
    -              .union(splitConjunctivePredicates(newConditionOpt.get).toSet))
    -          val newFilters = inferredConstraints
    -            .filterNot(right.constraints.contains)
    -            .reduceLeftOption(And)
    -          newFilters.map(Filter(_, right))
    -        case _ => None
    -      }
    +      joinType match {
    +        // For inner join, we can infer additional filters for both sides. LeftSemi is kind of an
    +        // inner join, it just drops the right side in the final output.
    +        case _: InnerLike | LeftSemi =>
    +          val allConstraints = getAllConstraints(left, right, conditionOpt)
    +          val newLeft = inferNewFilter(left, allConstraints)
    +          val newRight = inferNewFilter(right, allConstraints)
    +          join.copy(left = newLeft, right = newRight)
     
    -      if ((newConditionOpt.isDefined && (newConditionOpt ne conditionOpt))
    -        || newLeftOpt.isDefined || newRightOpt.isDefined) {
    -        Join(newLeftOpt.getOrElse(left), newRightOpt.getOrElse(right), joinType, newConditionOpt)
    -      } else {
    -        join
    +        // For right outer join, we can only infer additional filters for left side.
    +        case RightOuter =>
    +          val allConstraints = getAllConstraints(left, right, conditionOpt)
    +          val newLeft = inferNewFilter(left, allConstraints)
    +          join.copy(left = newLeft)
    +
    +        // For left join, we can only infer additional filters for right side.
    +        case LeftOuter | LeftAnti =>
    +          val allConstraints = getAllConstraints(left, right, conditionOpt)
    +          val newRight = inferNewFilter(right, allConstraints)
    +          join.copy(right = newRight)
    +
    +        case _ => join
           }
       }
    +
    +  private def getAllConstraints(
    +      left: LogicalPlan,
    +      right: LogicalPlan,
    +      conditionOpt: Option[Expression]): Set[Expression] = {
    +    val baseConstraints = left.constraints.union(right.constraints)
    +      .union(conditionOpt.map(splitConjunctivePredicates).getOrElse(Nil).toSet)
    +    baseConstraints.union(inferAdditionalConstraints(baseConstraints))
    +  }
    +
    +  private def inferNewFilter(plan: LogicalPlan, constraints: Set[Expression]): LogicalPlan = {
    +    val newPredicates = constraints
    +      .union(constructIsNotNullConstraints(constraints, plan.output))
    --- End diff --
    
    Can we put the code 
    ```
    constraints
              .union(inferAdditionalConstraints(constraints))
              .union(constructIsNotNullConstraints(constraints, output))
              .filter { c =>
                c.references.nonEmpty && c.references.subsetOf(outputSet) && c.deterministic
              }
    ```
    into the a function in `ConstraintHelper`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89480/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2376/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    **[Test build #89705 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89705/testReport)** for PR 21083 at commit [`787cddf`](https://github.com/apache/spark/commit/787cddffeba0f21cd40312bcbf84d1bb75126044).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait QueryPlanConstraints extends ConstraintHelper `
      * `trait ConstraintHelper `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by maryannxue <gi...@git.apache.org>.
Github user maryannxue commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Thank you for you reply, @cloud-fan! I was not clear when you had become aware of the effort on SPARK-21479 so it might be a misunderstanding on my side and I apologize. Anyway, if you had had a closer look at the PR, you would have probably got the idea that it's basically the same approach as what you have here, only that you have covered more join types.
    Here's another note. There's two types of constraint-to-filter inference for joins going on here:
    1. Infer from the Join node constraints, which is covered by the `PushPredicateThroughJoin` rule;
    2. Infer from the sibling child node combined with the join condition, which is what you've added here.
    That said, the InnerLike joins should already be covered by 1 and might not be worth being considered again in this optimization rule. Not sure about LeftSemi joins, so it would be nice if we could have a test case that proves this optimization does something that has not yet been covered by `PushPredicateThroughJoin` rule.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    **[Test build #89480 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89480/testReport)** for PR 21083 at commit [`787cddf`](https://github.com/apache/spark/commit/787cddffeba0f21cd40312bcbf84d1bb75126044).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89432/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by maryannxue <gi...@git.apache.org>.
Github user maryannxue commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    cc @mgaido91 @maryannxue @KaiXinXiaoLei @gatorsmile @jiangxb1987 @gengliangwang: I do not think this is the right way to do things, @cloud-fan. Looks like you have been aware of my and others' work like https://github.com/apache/spark/pull/20816, you could have, or I'd say, should have, given your input/suggestions on related PRs. People who have worked on this should deserve more credit than being mentioned as "inspired" here. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21083: [SPARK-21479][SPARK-23564][SQL] infer additional ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21083#discussion_r182297617
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -661,21 +661,51 @@ object InferFiltersFromConstraints extends Rule[LogicalPlan] with PredicateHelpe
           }
     
         case join @ Join(left, right, joinType, conditionOpt) =>
    -      // Only consider constraints that can be pushed down completely to either the left or the
    -      // right child
    -      val constraints = join.allConstraints.filter { c =>
    -        c.references.subsetOf(left.outputSet) || c.references.subsetOf(right.outputSet)
    -      }
    -      // Remove those constraints that are already enforced by either the left or the right child
    -      val additionalConstraints = constraints -- (left.constraints ++ right.constraints)
    -      val newConditionOpt = conditionOpt match {
    -        case Some(condition) =>
    -          val newFilters = additionalConstraints -- splitConjunctivePredicates(condition)
    -          if (newFilters.nonEmpty) Option(And(newFilters.reduce(And), condition)) else None
    -        case None =>
    -          additionalConstraints.reduceOption(And)
    +      joinType match {
    +        // For inner join, we can infer additional filters for both sides. LeftSemi is kind of an
    +        // inner join, it just drops the right side in the final output.
    +        case _: InnerLike | LeftSemi =>
    +          val allConstraints = getAllConstraints(left, right, conditionOpt)
    +          val newLeft = inferNewFilter(left, allConstraints)
    +          val newRight = inferNewFilter(right, allConstraints)
    +          join.copy(left = newLeft, right = newRight)
    +
    +        // For right outer join, we can only infer additional filters for left side.
    +        case RightOuter =>
    +          val allConstraints = getAllConstraints(left, right, conditionOpt)
    +          val newLeft = inferNewFilter(left, allConstraints)
    +          join.copy(left = newLeft)
    +
    +        // For left join, we can only infer additional filters for right side.
    +        case LeftOuter | LeftAnti =>
    +          val allConstraints = getAllConstraints(left, right, conditionOpt)
    +          val newRight = inferNewFilter(right, allConstraints)
    +          join.copy(right = newRight)
    +
    +        case _ => join
           }
    -      if (newConditionOpt.isDefined) Join(left, right, joinType, newConditionOpt) else join
    +  }
    +
    +  private def getAllConstraints(
    +      left: LogicalPlan,
    +      right: LogicalPlan,
    +      conditionOpt: Option[Expression]): Set[Expression] = {
    +    val baseConstraints = left.constraints.union(right.constraints)
    +      .union(conditionOpt.map(splitConjunctivePredicates).getOrElse(Nil).toSet)
    +    baseConstraints.union(ConstraintsUtils.inferAdditionalConstraints(baseConstraints))
    +  }
    +
    +  private def inferNewFilter(plan: LogicalPlan, constraints: Set[Expression]): LogicalPlan = {
    --- End diff --
    
    copying the plan is very cheap.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2572/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    **[Test build #89431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89431/testReport)** for PR 21083 at commit [`b967955`](https://github.com/apache/spark/commit/b967955ec2c7d33f28845dd55a1a9b70c5c2ba03).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    **[Test build #89701 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89701/testReport)** for PR 21083 at commit [`787cddf`](https://github.com/apache/spark/commit/787cddffeba0f21cd40312bcbf84d1bb75126044).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait QueryPlanConstraints extends ConstraintHelper `
      * `trait ConstraintHelper `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    **[Test build #89435 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89435/testReport)** for PR 21083 at commit [`561db44`](https://github.com/apache/spark/commit/561db44b6a42a4a58996e58dfd9555bf45006e9f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    **[Test build #89432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89432/testReport)** for PR 21083 at commit [`371c1df`](https://github.com/apache/spark/commit/371c1df8aed72ed42f74c8c4cdbc6c4c1791fcdf).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21083: [SPARK-23564][SQL] infer additional filters from ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21083#discussion_r183299130
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -664,53 +662,52 @@ object InferFiltersFromConstraints extends Rule[LogicalPlan] with PredicateHelpe
           }
     
         case join @ Join(left, right, joinType, conditionOpt) =>
    -      // Only consider constraints that can be pushed down completely to either the left or the
    -      // right child
    -      val constraints = join.allConstraints.filter { c =>
    -        c.references.subsetOf(left.outputSet) || c.references.subsetOf(right.outputSet)
    -      }
    -      // Remove those constraints that are already enforced by either the left or the right child
    -      val additionalConstraints = constraints -- (left.constraints ++ right.constraints)
    -      val newConditionOpt = conditionOpt match {
    -        case Some(condition) =>
    -          val newFilters = additionalConstraints -- splitConjunctivePredicates(condition)
    -          if (newFilters.nonEmpty) Option(And(newFilters.reduce(And), condition)) else conditionOpt
    -        case None =>
    -          additionalConstraints.reduceOption(And)
    -      }
    -      // Infer filter for left/right outer joins
    -      val newLeftOpt = joinType match {
    -        case RightOuter if newConditionOpt.isDefined =>
    -          val inferredConstraints = left.getRelevantConstraints(
    -            left.constraints
    -              .union(right.constraints)
    -              .union(splitConjunctivePredicates(newConditionOpt.get).toSet))
    -          val newFilters = inferredConstraints
    -            .filterNot(left.constraints.contains)
    -            .reduceLeftOption(And)
    -          newFilters.map(Filter(_, left))
    -        case _ => None
    -      }
    -      val newRightOpt = joinType match {
    -        case LeftOuter if newConditionOpt.isDefined =>
    -          val inferredConstraints = right.getRelevantConstraints(
    -            right.constraints
    -              .union(left.constraints)
    -              .union(splitConjunctivePredicates(newConditionOpt.get).toSet))
    -          val newFilters = inferredConstraints
    -            .filterNot(right.constraints.contains)
    -            .reduceLeftOption(And)
    -          newFilters.map(Filter(_, right))
    -        case _ => None
    -      }
    +      joinType match {
    +        // For inner join, we can infer additional filters for both sides. LeftSemi is kind of an
    +        // inner join, it just drops the right side in the final output.
    +        case _: InnerLike | LeftSemi =>
    +          val allConstraints = getAllConstraints(left, right, conditionOpt)
    +          val newLeft = inferNewFilter(left, allConstraints)
    +          val newRight = inferNewFilter(right, allConstraints)
    +          join.copy(left = newLeft, right = newRight)
     
    -      if ((newConditionOpt.isDefined && (newConditionOpt ne conditionOpt))
    -        || newLeftOpt.isDefined || newRightOpt.isDefined) {
    -        Join(newLeftOpt.getOrElse(left), newRightOpt.getOrElse(right), joinType, newConditionOpt)
    -      } else {
    -        join
    +        // For right outer join, we can only infer additional filters for left side.
    +        case RightOuter =>
    +          val allConstraints = getAllConstraints(left, right, conditionOpt)
    +          val newLeft = inferNewFilter(left, allConstraints)
    +          join.copy(left = newLeft)
    +
    +        // For left join, we can only infer additional filters for right side.
    +        case LeftOuter | LeftAnti =>
    +          val allConstraints = getAllConstraints(left, right, conditionOpt)
    +          val newRight = inferNewFilter(right, allConstraints)
    +          join.copy(right = newRight)
    +
    +        case _ => join
           }
       }
    +
    +  private def getAllConstraints(
    +      left: LogicalPlan,
    +      right: LogicalPlan,
    +      conditionOpt: Option[Expression]): Set[Expression] = {
    +    val baseConstraints = left.constraints.union(right.constraints)
    +      .union(conditionOpt.map(splitConjunctivePredicates).getOrElse(Nil).toSet)
    +    baseConstraints.union(inferAdditionalConstraints(baseConstraints))
    +  }
    +
    +  private def inferNewFilter(plan: LogicalPlan, constraints: Set[Expression]): LogicalPlan = {
    +    val newPredicates = constraints
    +      .union(constructIsNotNullConstraints(constraints, plan.output))
    --- End diff --
    
    ```
    case _: InnerLike | LeftSemi =>
              val allConstraints = getAllConstraints(left, right, conditionOpt)
              val newLeft = inferNewFilter(left, allConstraints)
              val newRight = inferNewFilter(right, allConstraints)
              join.copy(left = newLeft, right = newRight)
    ```
    
    For this pattern, if we reuse the code you mentioned, we need to do constraints expanding twice, for left and right.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    **[Test build #89705 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89705/testReport)** for PR 21083 at commit [`787cddf`](https://github.com/apache/spark/commit/787cddffeba0f21cd40312bcbf84d1bb75126044).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    **[Test build #89435 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89435/testReport)** for PR 21083 at commit [`561db44`](https://github.com/apache/spark/commit/561db44b6a42a4a58996e58dfd9555bf45006e9f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2373/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2574/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2412/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89701/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89431/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    lgtm


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    It's a totally different approach to do additional filters inference from constraints for join, and I'm not sure if it works until I finished this PR. That's the main reason I didn't post my suggestion to the related PRs.
    
    I actually don't care about the credites and was planning to give credites to @mgaido91 since it's mostly inspired by the discussion with him. However someone mentioned your PR to me at the last minute, I pulled in your tests and found it's already covered. Then I was a little hesitate about who should get the credites and didn't mention it in the PR desription.
    
    If you don't mind, are you OK with giving credites to @mgaido91 ? Thanks for your understanding!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    **[Test build #89480 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89480/testReport)** for PR 21083 at commit [`787cddf`](https://github.com/apache/spark/commit/787cddffeba0f21cd40312bcbf84d1bb75126044).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait QueryPlanConstraints extends ConstraintHelper `
      * `trait ConstraintHelper `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    cc @mgaido91 @maryannxue @KaiXinXiaoLei @gatorsmile @jiangxb1987 @gengliangwang 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    @cloud-fan the approach itself seems OK to me. Indeed, I prefer this one over the previous status where we had `constraints` enforced on the output and `allConstraints` containing also the others: I think it was pretty confusing and I think that moving those additional constraints to the optimizer is the right thing to do. Anyway, I will review it more carefully in the next days, asap.
    
    As far as the credit is regarded, there is no issue for me. You can give credit to @maryannxue and @KaiXinXiaoLei.
    
    I think I can close #20717 then, since it is going to be covered here, am I right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21083: [SPARK-23564][SQL] infer additional filters from ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21083


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    > That said, the InnerLike joins should already be covered by 1 and might not be worth being considered again in this optimization rule.
    
    Previously the `InferFiltersFromConstraints` adds the additional filters to join condition, so inner joins are covered by 1. Here I changed this rule to directly add additional filters to join children, so inner joins also need to be considered here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89435/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    **[Test build #89701 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89701/testReport)** for PR 21083 at commit [`787cddf`](https://github.com/apache/spark/commit/787cddffeba0f21cd40312bcbf84d1bb75126044).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21083: [SPARK-21479][SPARK-23564][SQL] infer additional ...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21083#discussion_r182034339
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -661,21 +661,51 @@ object InferFiltersFromConstraints extends Rule[LogicalPlan] with PredicateHelpe
           }
     
         case join @ Join(left, right, joinType, conditionOpt) =>
    -      // Only consider constraints that can be pushed down completely to either the left or the
    -      // right child
    -      val constraints = join.allConstraints.filter { c =>
    -        c.references.subsetOf(left.outputSet) || c.references.subsetOf(right.outputSet)
    -      }
    -      // Remove those constraints that are already enforced by either the left or the right child
    -      val additionalConstraints = constraints -- (left.constraints ++ right.constraints)
    -      val newConditionOpt = conditionOpt match {
    -        case Some(condition) =>
    -          val newFilters = additionalConstraints -- splitConjunctivePredicates(condition)
    -          if (newFilters.nonEmpty) Option(And(newFilters.reduce(And), condition)) else None
    -        case None =>
    -          additionalConstraints.reduceOption(And)
    +      joinType match {
    +        // For inner join, we can infer additional filters for both sides. LeftSemi is kind of an
    +        // inner join, it just drops the right side in the final output.
    +        case _: InnerLike | LeftSemi =>
    +          val allConstraints = getAllConstraints(left, right, conditionOpt)
    +          val newLeft = inferNewFilter(left, allConstraints)
    +          val newRight = inferNewFilter(right, allConstraints)
    +          join.copy(left = newLeft, right = newRight)
    +
    +        // For right outer join, we can only infer additional filters for left side.
    +        case RightOuter =>
    +          val allConstraints = getAllConstraints(left, right, conditionOpt)
    +          val newLeft = inferNewFilter(left, allConstraints)
    +          join.copy(left = newLeft)
    +
    +        // For left join, we can only infer additional filters for right side.
    +        case LeftOuter | LeftAnti =>
    +          val allConstraints = getAllConstraints(left, right, conditionOpt)
    +          val newRight = inferNewFilter(right, allConstraints)
    +          join.copy(right = newRight)
    +
    +        case _ => join
           }
    -      if (newConditionOpt.isDefined) Join(left, right, joinType, newConditionOpt) else join
    +  }
    +
    +  private def getAllConstraints(
    +      left: LogicalPlan,
    +      right: LogicalPlan,
    +      conditionOpt: Option[Expression]): Set[Expression] = {
    +    val baseConstraints = left.constraints.union(right.constraints)
    +      .union(conditionOpt.map(splitConjunctivePredicates).getOrElse(Nil).toSet)
    +    baseConstraints.union(ConstraintsUtils.inferAdditionalConstraints(baseConstraints))
    +  }
    +
    +  private def inferNewFilter(plan: LogicalPlan, constraints: Set[Expression]): LogicalPlan = {
    --- End diff --
    
    may we return here the additional constraints instead of the new plan, so that in L669 and similar we can copy the plan only if it is needed?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-23564][SQL] infer additional filters from constra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89705/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21083: [SPARK-21479][SPARK-23564][SQL] infer additional ...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21083#discussion_r182032397
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala ---
    @@ -54,13 +51,42 @@ trait QueryPlanConstraints { self: LogicalPlan =>
        * See [[Canonicalize]] for more details.
        */
       protected def validConstraints: Set[Expression] = Set.empty
    +}
    +
    +object ConstraintsUtils {
    --- End diff --
    
    nit: in order to follow the pattern I see also for `PredicateHelper`, what about having a `ConstraintHelper` trait instead of this object?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    **[Test build #89432 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89432/testReport)** for PR 21083 at commit [`371c1df`](https://github.com/apache/spark/commit/371c1df8aed72ed42f74c8c4cdbc6c4c1791fcdf).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21083
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org