You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by aokolnychyi <gi...@git.apache.org> on 2017/07/20 20:32:49 UTC

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Detect joind conditions via fi...

GitHub user aokolnychyi opened a pull request:

    https://github.com/apache/spark/pull/18692

    [SPARK-21417][SQL] Detect joind conditions via filter expressions

    ## What changes were proposed in this pull request?
    
    This PR adds an optimization rule that infers join conditions based on filter expressions that are specified. 
    
    For example, 
    `SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND t2.col2 = 1` 
    can be transformed into 
    `SELECT * FROM t1 JOIN t2 ON t1.col1 = t2.col2 WHERE t1.col1 = 1 AND t2.col2 = 1`.
    
    Refer to the corresponding ticket and tests for more details.
    
    ## How was this patch tested?
    
    This patch comes with a new test suite to cover the implemented logic.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aokolnychyi/spark spark-21417

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18692.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18692
    
----
commit e67d4d3c0bdf5cac4c6b17b50314984a2a6378d2
Author: aokolnychyi <an...@sap.com>
Date:   2017-07-18T18:49:16Z

    [SPARK-21417][SQL] Detect joind conditions via filter expressions

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82892/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    Yeah. That is a wrong case. Let us revisit it if we can find any useful case here. Thank you!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

Posted by aokolnychyi <gi...@git.apache.org>.

Github user aokolnychyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18692#discussion_r144722742
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ---
    @@ -152,3 +152,71 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * A rule that uses propagated constraints to infer join conditions. The optimization is applicable
    + * only to CROSS joins.
    + *
    + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right
    + * relation has 'b = 1', then the rule infers 'a = b' as a join predicate.
    + */
    +object InferJoinConditionsFromConstraints extends Rule[LogicalPlan] with PredicateHelper {
    +
    +  def apply(plan: LogicalPlan): LogicalPlan = {
    +    if (SQLConf.get.constraintPropagationEnabled) {
    +      inferJoinConditions(plan)
    +    } else {
    +      plan
    +    }
    +  }
    +
    +  private def inferJoinConditions(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case join @ Join(left, right, Cross, conditionOpt) =>
    +      val leftConstraints = join.constraints.filter(_.references.subsetOf(left.outputSet))
    +      val rightConstraints = join.constraints.filter(_.references.subsetOf(right.outputSet))
    --- End diff --
    
    @gengliangwang Yeah, makes sense. So, ``PushPredicateThroughJoin`` would push the where clause into the join and the proposed rule will infer ``t1.col1 = t2.col1`` and change the join type to INNER. As a result, the final join condition will be ``t1.col1 = t2.col1 and t1.col1 >= t2.col1 and (t1.col1 = t1.col2 + t2.col2 and t2.col1 = t1.col2 + t2.col2)``. Am I right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

Posted by aokolnychyi <gi...@git.apache.org>.

Github user aokolnychyi commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    @gatorsmile what is our decision here? Shall we wait until SPARK-21652 is resolved? In the meantime, I can add some tests and see how the proposed rule works together with all others. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    **[Test build #84351 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84351/testReport)** for PR 18692 at commit [`9ab91a1`](https://github.com/apache/spark/commit/9ab91a19cefd63b7d28674992b68da8164d487ae).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18692#discussion_r153060560
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ---
    @@ -152,3 +152,99 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * A rule that eliminates CROSS joins by inferring join conditions from propagated constraints.
    + *
    + * The optimization is applicable only to CROSS joins. For other join types, adding inferred join
    + * conditions would potentially shuffle children as child node's partitioning won't satisfy the JOIN
    + * node's requirements which otherwise could have.
    + *
    + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right
    + * relation has 'b = 1', the rule infers 'a = b' as a join predicate.
    + */
    +object EliminateCrossJoin extends Rule[LogicalPlan] with PredicateHelper {
    +
    +  def apply(plan: LogicalPlan): LogicalPlan = {
    +    if (SQLConf.get.constraintPropagationEnabled) {
    +      eliminateCrossJoin(plan)
    +    } else {
    +      plan
    +    }
    +  }
    +
    +  private def eliminateCrossJoin(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case join@Join(leftPlan, rightPlan, Cross, None) =>
    --- End diff --
    
    Nit: `join@Join` -> `join @ Join`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    **[Test build #84351 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84351/testReport)** for PR 18692 at commit [`9ab91a1`](https://github.com/apache/spark/commit/9ab91a19cefd63b7d28674992b68da8164d487ae).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

Posted by aokolnychyi <gi...@git.apache.org>.

Github user aokolnychyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18692#discussion_r152660385
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ---
    @@ -152,3 +152,71 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * A rule that uses propagated constraints to infer join conditions. The optimization is applicable
    + * only to CROSS joins.
    + *
    + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right
    + * relation has 'b = 1', then the rule infers 'a = b' as a join predicate.
    + */
    +object InferJoinConditionsFromConstraints extends Rule[LogicalPlan] with PredicateHelper {
    +
    +  def apply(plan: LogicalPlan): LogicalPlan = {
    +    if (SQLConf.get.constraintPropagationEnabled) {
    +      inferJoinConditions(plan)
    +    } else {
    +      plan
    +    }
    +  }
    +
    +  private def inferJoinConditions(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case join @ Join(left, right, Cross, conditionOpt) =>
    +      val leftConstraints = join.constraints.filter(_.references.subsetOf(left.outputSet))
    +      val rightConstraints = join.constraints.filter(_.references.subsetOf(right.outputSet))
    +      val inferredJoinPredicates = inferJoinPredicates(leftConstraints, rightConstraints)
    +
    +      val newConditionOpt = conditionOpt match {
    +        case Some(condition) =>
    +          val existingPredicates = splitConjunctivePredicates(condition)
    +          val newPredicates = findNewPredicates(inferredJoinPredicates, existingPredicates)
    +          if (newPredicates.nonEmpty) Some(And(newPredicates.reduce(And), condition)) else None
    +        case None =>
    +          inferredJoinPredicates.reduceOption(And)
    +      }
    +      if (newConditionOpt.isDefined) Join(left, right, Inner, newConditionOpt) else join
    --- End diff --
    
    @gatorsmile Thanks for getting back.
    
    ``CheckCartesianProducts`` identifies a join of type ``Inner | LeftOuter | RightOuter | FullOuter`` as a cartesian product if there is no join predicate that has references to both relations.
    
    If we agree to ignore joins of type Cross that have a condition (in this PR), then the use case in this [discussion](https://github.com/apache/spark/pull/18692#discussion_r144466472) is no longer possible (even if you remove t1.col1 >= t2.col1). Correct? ``PushPredicateThroughJoin`` will push ``t1.col1 = t1.col2 + t2.col2 and t2.col1 = t1.col2 + t2.col2`` into the join condition and the proposed rule will not infer anything and the 
    final join will be of type Cross with a condition that covers both relations. According to the logic of ``CheckCartesianProducts``, it is not considered to be a cartesian product (since there exists a join predicate that covers both relations, e.g. ``t1.col1 = t1.col2 + t2.col2``).
    
    So, if I have a confirmation that we need to consider only joins of type Cross and without any join conditions, I can update the PR accordingly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18692#discussion_r153060551
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ---
    @@ -152,3 +152,99 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * A rule that eliminates CROSS joins by inferring join conditions from propagated constraints.
    + *
    + * The optimization is applicable only to CROSS joins. For other join types, adding inferred join
    + * conditions would potentially shuffle children as child node's partitioning won't satisfy the JOIN
    + * node's requirements which otherwise could have.
    + *
    + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right
    + * relation has 'b = 1', the rule infers 'a = b' as a join predicate.
    --- End diff --
    
    > For instance, given a CROSS join with the constraint 'a = 1' from the left child and the constraint 'b = 1' from the right child, this rule infers a new join predicate 'a = b' and convert it to an Inner join.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Detect joind conditions via filter ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    **[Test build #80056 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80056/testReport)** for PR 18692 at commit [`915dc7e`](https://github.com/apache/spark/commit/915dc7ecb1891ce7387e49b8eab915049bd34f93).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18692#discussion_r144466472
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ---
    @@ -152,3 +152,71 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * A rule that uses propagated constraints to infer join conditions. The optimization is applicable
    + * only to CROSS joins.
    + *
    + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right
    + * relation has 'b = 1', then the rule infers 'a = b' as a join predicate.
    + */
    +object InferJoinConditionsFromConstraints extends Rule[LogicalPlan] with PredicateHelper {
    +
    +  def apply(plan: LogicalPlan): LogicalPlan = {
    +    if (SQLConf.get.constraintPropagationEnabled) {
    +      inferJoinConditions(plan)
    +    } else {
    +      plan
    +    }
    +  }
    +
    +  private def inferJoinConditions(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case join @ Join(left, right, Cross, conditionOpt) =>
    +      val leftConstraints = join.constraints.filter(_.references.subsetOf(left.outputSet))
    +      val rightConstraints = join.constraints.filter(_.references.subsetOf(right.outputSet))
    --- End diff --
    
    I don't think we need to separate the constraints as left only and right only.
    The following case can infer `t1.col1 = t2.col1`:
    ```scala
    Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
    Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t2")
    val df = spark.sql("SELECT * FROM t1 CROSS JOIN t2 ON t1.col1 >= t2.col1 " +
       "WHERE t1.col1 = t1.col2 + t2.col2 and t2.col1 = t1.col2 + t2.col2")
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81390/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    In this PR, we should limit it to `cartesian product` now. In the future, we need to perform smarter when extracting equi-join keys.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18692#discussion_r152725251
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ---
    @@ -152,3 +152,71 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * A rule that uses propagated constraints to infer join conditions. The optimization is applicable
    + * only to CROSS joins.
    + *
    + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right
    + * relation has 'b = 1', then the rule infers 'a = b' as a join predicate.
    + */
    +object InferJoinConditionsFromConstraints extends Rule[LogicalPlan] with PredicateHelper {
    +
    +  def apply(plan: LogicalPlan): LogicalPlan = {
    +    if (SQLConf.get.constraintPropagationEnabled) {
    +      inferJoinConditions(plan)
    +    } else {
    +      plan
    +    }
    +  }
    +
    +  private def inferJoinConditions(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case join @ Join(left, right, Cross, conditionOpt) =>
    +      val leftConstraints = join.constraints.filter(_.references.subsetOf(left.outputSet))
    +      val rightConstraints = join.constraints.filter(_.references.subsetOf(right.outputSet))
    +      val inferredJoinPredicates = inferJoinPredicates(leftConstraints, rightConstraints)
    +
    +      val newConditionOpt = conditionOpt match {
    +        case Some(condition) =>
    +          val existingPredicates = splitConjunctivePredicates(condition)
    +          val newPredicates = findNewPredicates(inferredJoinPredicates, existingPredicates)
    +          if (newPredicates.nonEmpty) Some(And(newPredicates.reduce(And), condition)) else None
    +        case None =>
    +          inferredJoinPredicates.reduceOption(And)
    +      }
    +      if (newConditionOpt.isDefined) Join(left, right, Inner, newConditionOpt) else join
    --- End diff --
    
    Yes. In this PR, we just need to consider cross join without any join condition. 
    
    In the future, we can extend it. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

Posted by aokolnychyi <gi...@git.apache.org>.

Github user aokolnychyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18692#discussion_r137343433
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ---
    @@ -152,3 +152,71 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * A rule that uses propagated constraints to infer join conditions. The optimization is applicable
    + * only to CROSS joins.
    + *
    + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right
    + * relation has 'b = 1', then the rule infers 'a = b' as a join predicate.
    + */
    +object InferJoinConditionsFromConstraints extends Rule[LogicalPlan] with PredicateHelper {
    --- End diff --
    
    I also thought about this but `InferFiltersFromConstraints` does not change considered join types. Therefore, I kept them separated. In addition, I thought about renaming it to `EliminateCrossJoin`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

Posted by SimonBin <gi...@git.apache.org>.

Github user SimonBin commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    @aokolnychyi thank you for the clarification, I see now


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Detect joind conditions via filter ex...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    I think we already did it via constraint propagation, didn't we?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18692#discussion_r152412423
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ---
    @@ -152,3 +152,71 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * A rule that uses propagated constraints to infer join conditions. The optimization is applicable
    + * only to CROSS joins.
    + *
    + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right
    + * relation has 'b = 1', then the rule infers 'a = b' as a join predicate.
    + */
    +object InferJoinConditionsFromConstraints extends Rule[LogicalPlan] with PredicateHelper {
    --- End diff --
    
    Yes. Since we decide to focus on cross join only, we should rename it to `EliminateCrossJoin `, like what you proposed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Detect joind conditions via filter ex...

Posted by aokolnychyi <gi...@git.apache.org>.

Github user aokolnychyi commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    @gatorsmile I took a look at the case above. Indeed, the proposed rule triggers this issue but only indirectly. In the example above, the optimizer will never reach a fixed point. Please, find my investigation below.
    
    ```
    ... 
    
    // The new rule infers correct join predicates
    Join Inner, ((col2#33 = col#34) && (col1#32 = col#34))
    :- Filter ((col1#32 = col2#33) && (col1#32 = 1))
    :  +- Relation[col1#32,col2#33] parquet
    +- Filter (col#34 = 1)
       +- Relation[col#34] parquet
    
    // InferFiltersFromConstraints adds more filters
    Join Inner, ((col2#33 = col#34) && (col1#32 = col#34))
    :- Filter ((((col2#33 = 1) && isnotnull(col1#32)) && isnotnull(col2#33)) && ((col1#32 = col2#33) && (col1#32 = 1)))
    :  +- Relation[col1#32,col2#33] parquet
    +- Filter (isnotnull(col#34) && (col#34 = 1))
       +- Relation[col#34] parquet
    
    // ConstantPropagation is applied
    Join Inner, ((col2#33 = col#34) && (col1#32 = col#34))
    !:- Filter (((((col2#33 = 1) && isnotnull(col2#33)) && isnotnull(col1#32)) && ((1 = col2#33) && (col1#32 = 1))) 
     :  +- Relation[col1#32,col2#33] parquet
     +- Filter (isnotnull(col#34) && (col#34 = 1))
        +- Relation[col#34] parquet                          
    
    // (Important) InferFiltersFromConstraints infers (col1#32 = col2#33), which is added to the join condition.
    Join Inner, ((col1#32 = col2#33) && ((col2#33 = col#34) && (col1#32 = col#34)))
    !:- Filter (((((col2#33 = 1) && isnotnull(col2#33)) && isnotnull(col1#32)) && ((1 = col2#33) && (col1#32 = 1))) 
     :  +- Relation[col1#32,col2#33] parquet
     +- Filter (isnotnull(col#34) && (col#34 = 1))
        +- Relation[col#34] parquet
    
     // PushPredicateThroughJoin pushes down (col1#32 = col2#33) and then CombineFilters produces
    Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
    !:- Filter ((((isnotnull(col1#32) && (col2#33 = 1)) && isnotnull(col2#33)) && ((1 = col2#33) && (col1#32 = 1))) && (col2#33 = col1#32))
     :  +- Relation[col1#32,col2#33] parquet
     +- Filter (isnotnull(col#34) && (col#34 = 1))
        +- Relation[col#34] parquet                                                                      
    
    ```
    After that, `ConstantPropagation` replaces `(col2#33 = col1#32)` as `(1 = 1)`, `BooleanSimplification` removes `(1 = 1)`, `InferFiltersFromConstraints` infers `(col2#33 = col1#32)` again and the procedure repeats forever. Since `InferFiltersFromConstraints` is the last optimization rule, we have the redundant condition mentioned by you. The Optimizer without the new rule will also not converge on the following query:
    
    ```
    Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
    Seq(1, 2).toDF("col").write.saveAsTable("t2")
    spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col").explain(true)
    ```
    Correct me if I am wrong, but it seems like an issue with the existing rules.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

Posted by aokolnychyi <gi...@git.apache.org>.

Github user aokolnychyi commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    Yeah, correct. So, we should revert then.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

Posted by aokolnychyi <gi...@git.apache.org>.

Github user aokolnychyi commented on the issue:

    https://github.com/apache/spark/pull/18692
  
    I am not sure we can infer ``a == b`` if ``a in (0, 2, 3, 4)`` and ``b in (0, 2, 3, 4)``. 
    
    table 'a'
    ```
    a1 a2
    1  2
    3  3
    4  5
    ```
    
    table 'b'
    ```
    b1 b2
    1  -1
    2  -2
    3  -4
    ```
    
    ```
    SELECT * FROM a, b WHERE a1 in (1, 2) AND b1 in (1, 2)
    // 1 2 1 -1
    // 1 2 2 -2
    ```
    ```
    SELECT * FROM a JOIN b ON a.a1 = b.b1 WHERE a1 in (1, 2) AND b1 in (1, 2)
    // 1 2 1 -1
    ```
    
    Do I miss anything?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

Posted by aokolnychyi <gi...@git.apache.org>.

Github user aokolnychyi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18692#discussion_r153066992
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ---
    @@ -152,3 +152,99 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * A rule that eliminates CROSS joins by inferring join conditions from propagated constraints.
    + *
    + * The optimization is applicable only to CROSS joins. For other join types, adding inferred join
    + * conditions would potentially shuffle children as child node's partitioning won't satisfy the JOIN
    + * node's requirements which otherwise could have.
    + *
    + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right
    + * relation has 'b = 1', the rule infers 'a = b' as a join predicate.
    + */
    +object EliminateCrossJoin extends Rule[LogicalPlan] with PredicateHelper {
    +
    +  def apply(plan: LogicalPlan): LogicalPlan = {
    +    if (SQLConf.get.constraintPropagationEnabled) {
    +      eliminateCrossJoin(plan)
    +    } else {
    +      plan
    +    }
    +  }
    +
    +  private def eliminateCrossJoin(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case join@Join(leftPlan, rightPlan, Cross, None) =>
    +      val leftConstraints = join.constraints.filter(_.references.subsetOf(leftPlan.outputSet))
    +      val rightConstraints = join.constraints.filter(_.references.subsetOf(rightPlan.outputSet))
    +      val inferredJoinPredicates = inferJoinPredicates(leftConstraints, rightConstraints)
    +      val joinConditionOpt = inferredJoinPredicates.reduceOption(And)
    +      if (joinConditionOpt.isDefined) Join(leftPlan, rightPlan, Inner, joinConditionOpt) else join
    +  }
    +
    +  private def inferJoinPredicates(
    +      leftConstraints: Set[Expression],
    +      rightConstraints: Set[Expression]): Set[EqualTo] = {
    +
    +    // iterate through the left constraints and build a hash map that points semantically
    +    // equivalent expressions into attributes
    +    val emptyEquivalenceMap = Map.empty[SemanticExpression, Set[Attribute]]
    +    val equivalenceMap = leftConstraints.foldLeft(emptyEquivalenceMap) { case (map, constraint) =>
    +      constraint match {
    +        case EqualTo(attr: Attribute, expr: Expression) =>
    +          updateEquivalenceMap(map, attr, expr)
    +        case EqualTo(expr: Expression, attr: Attribute) =>
    +          updateEquivalenceMap(map, attr, expr)
    +        case _ => map
    +      }
    +    }
    +
    +    // iterate through the right constraints and infer join conditions using the equivalence map
    +    rightConstraints.foldLeft(Set.empty[EqualTo]) { case (joinConditions, constraint) =>
    +      constraint match {
    +        case EqualTo(attr: Attribute, expr: Expression) =>
    +          appendJoinConditions(attr, expr, equivalenceMap, joinConditions)
    +        case EqualTo(expr: Expression, attr: Attribute) =>
    +          appendJoinConditions(attr, expr, equivalenceMap, joinConditions)
    +        case _ => joinConditions
    +      }
    +    }
    +  }
    +
    +  private def updateEquivalenceMap(
    +      equivalenceMap: Map[SemanticExpression, Set[Attribute]],
    +      attr: Attribute,
    +      expr: Expression): Map[SemanticExpression, Set[Attribute]] = {
    +
    +    val equivalentAttrs = equivalenceMap.getOrElse(expr, Set.empty[Attribute])
    +    if (equivalentAttrs.contains(attr)) {
    +      equivalenceMap
    +    } else {
    +      equivalenceMap.updated(expr, equivalentAttrs + attr)
    +    }
    +  }
    +
    +  private def appendJoinConditions(
    +      attr: Attribute,
    +      expr: Expression,
    +      equivalenceMap: Map[SemanticExpression, Set[Attribute]],
    +      joinConditions: Set[EqualTo]): Set[EqualTo] = {
    +
    +    equivalenceMap.get(expr) match {
    +      case Some(equivalentAttrs) => joinConditions ++ equivalentAttrs.map(EqualTo(attr, _))
    +      case None => joinConditions
    +    }
    +  }
    +
    +  // the purpose of this class is to treat 'a === 1 and 1 === 'a as the same expressions
    +  implicit class SemanticExpression(private val expr: Expression) {
    --- End diff --
    
    @gatorsmile 
    
    I think we just need the case class inside ``EquivalentExpressions`` since we have to map all semantically equivalent expressions into a set of attributes (as opposed to mapping an expression into a set of equivalent expressions). 
    
    I see two ways to go:
    
    1. Expose the case class inside ``EquivalentExpressions`` with minimum changes in the code base (e.g., using a companion object):
    
    ````
    object EquivalentExpressions {
    
      /**
       * Wrapper around an Expression that provides semantic equality.
       */
      implicit class SemanticExpr(private val e: Expression) {
        override def equals(o: Any): Boolean = o match {
          case other: SemanticExpr => e.semanticEquals(other.e)
          case _ => false
        }
    
        override def hashCode: Int = e.semanticHash()
      }
    }
    ````
    
    2. Keep ``EquivalentExpressions`` as it is and maintain a separate map from expressions to attributes in the proposed rule.
    
    Personally, I lean toward the first idea since it might be useful to have ``SemanticExpr`` alone. However, there can be other drawbacks that did not come into my mind.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org