You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by wzhfy <gi...@git.apache.org> on 2017/03/14 08:22:47 UTC

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product resu...

GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/17286

    [SPARK-19915][SQL] Exclude cartesian product results in join reorder

    ## What changes were proposed in this pull request?
    
    Exclude cartesian products in the "memo". This significantly reduces the search space and memory overhead of memo. Otherwise every combination of items will exist in the memo. We can find those unjoinable items after reordering is finished and put them at the end.
    
    This pr also includes a bug fix: if a leaf item is a project(_, child), current solution will miss the project.
    
    ## How was this patch tested?
    
    Add a test case.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark joinReorder3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17286.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17286
    
----
commit f472c74189f5f54c701876aaa73f7d331e5b814d
Author: wangzhenhua <wa...@huawei.com>
Date:   2017-03-14T08:15:45Z

    improve join reorder

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by nsyca <gi...@git.apache.org>.

Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Right. I misread it. if there is no join predicate between a table and any cluster of tables, we should not consider that table in the join enumeration at all. We can simply push that table to be the last join.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106582845
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -272,26 +256,39 @@ object JoinReorderDP extends PredicateHelper {
        * @param itemIds Set of item ids participating in this partial plan.
        * @param plan The plan tree with the lowest cost for these items found so far.
        * @param joinConds Join conditions included in the plan.
    -   * @param cost The cost of this plan is the sum of costs of all intermediate joins.
    +   * @param planCost The cost of this plan tree is the sum of costs of all intermediate joins.
    --- End diff --
    
    I think `cost` is good enough, why rename it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    **[Test build #74527 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74527/testReport)** for PR 17286 at commit [`9aee29a`](https://github.com/apache/spark/commit/9aee29a1d50c3a0a1b332f00ca5b41d5f0a2d877).
     * This patch **fails SparkR unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Given {A, B, C} with the join conditions {A.a = B.b, B.b > C.c}. `{A, C}` should be pruned. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    **[Test build #74656 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74656/testReport)** for PR 17286 at commit [`237ad9d`](https://github.com/apache/spark/commit/237ad9d1575b3b08a60e6f0e0160ab9636b1453c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    @ioana-delaney How can we get disconnected parts before reordering? They can not only be leaf tables (this case is easier to deal with), but also subplans (tables can be joined internally but among subplans there's no join condition). We need to consider both cases and they may co-exist. Also, some disconnection can be eliminated by reordering.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106338345
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -128,38 +128,43 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
     object JoinReorderDP extends PredicateHelper {
     
       def search(
    -      conf: CatalystConf,
    +      conf: SQLConf,
           items: Seq[LogicalPlan],
           conditions: Set[Expression],
           topOutput: AttributeSet): Option[LogicalPlan] = {
     
         // Level i maintains all found plans for i + 1 items.
         // Create the initial plans: each plan is a single item with zero cost.
    -    val itemIndex = items.zipWithIndex
    +    val itemIndex = items.zipWithIndex.map(_.swap).toMap
         val foundPlans = mutable.Buffer[JoinPlanMap](itemIndex.map {
    -      case (item, id) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    -    }.toMap)
    +      case (id, item) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    +    })
     
    -    for (lev <- 1 until items.length) {
    +    // Build plans for next levels until the last level has only one plan. This plan contains
    +    // all items that can be joined, so there's no need to continue.
    +    while (foundPlans.size < items.length && foundPlans.last.size > 1) {
           // Build plans for the next level.
           foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
         }
     
    -    val plansLastLevel = foundPlans(items.length - 1)
    -    if (plansLastLevel.isEmpty) {
    -      // Failed to find a plan, fall back to the original plan
    -      None
    -    } else {
    -      // There must be only one plan at the last level, which contains all items.
    -      assert(plansLastLevel.size == 1 && plansLastLevel.head._1.size == items.length)
    -      Some(plansLastLevel.head._2.plan)
    +    // Find the best plan
    +    assert(foundPlans.last.size <= 1)
    --- End diff --
    
    how about
    ```
    while (foundPlans.size < items.length && foundPlans.last.size > 0)
    ```
    When we end the while loop, either we have reached the level n, or the current level has 0 entries. Then we pick the last level which has non-zero entries, and pick the best entry from this level, and construct the final join plan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74527/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106582563
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -128,38 +131,34 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
     object JoinReorderDP extends PredicateHelper {
     
       def search(
    -      conf: CatalystConf,
    +      conf: SQLConf,
           items: Seq[LogicalPlan],
           conditions: Set[Expression],
    -      topOutput: AttributeSet): Option[LogicalPlan] = {
    +      topOutput: AttributeSet): LogicalPlan = {
     
         // Level i maintains all found plans for i + 1 items.
         // Create the initial plans: each plan is a single item with zero cost.
    -    val itemIndex = items.zipWithIndex
    +    val itemIndex = items.zipWithIndex.map(_.swap).toMap
    --- End diff --
    
    looks like an unnecessary change now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    **[Test build #74741 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74741/testReport)** for PR 17286 at commit [`aa75b6e`](https://github.com/apache/spark/commit/aa75b6edc0803b537c46a730a1cc9e2b8ac424a4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17286


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74737/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106179640
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -128,38 +128,43 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
     object JoinReorderDP extends PredicateHelper {
     
       def search(
    -      conf: CatalystConf,
    +      conf: SQLConf,
           items: Seq[LogicalPlan],
           conditions: Set[Expression],
           topOutput: AttributeSet): Option[LogicalPlan] = {
     
         // Level i maintains all found plans for i + 1 items.
         // Create the initial plans: each plan is a single item with zero cost.
    -    val itemIndex = items.zipWithIndex
    +    val itemIndex = items.zipWithIndex.map(_.swap).toMap
         val foundPlans = mutable.Buffer[JoinPlanMap](itemIndex.map {
    -      case (item, id) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    -    }.toMap)
    +      case (id, item) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    +    })
     
    -    for (lev <- 1 until items.length) {
    +    // Build plans for next levels until the last level has only one plan. This plan contains
    +    // all items that can be joined, so there's no need to continue.
    +    while (foundPlans.size < items.length && foundPlans.last.size > 1) {
           // Build plans for the next level.
           foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
         }
     
    -    val plansLastLevel = foundPlans(items.length - 1)
    -    if (plansLastLevel.isEmpty) {
    -      // Failed to find a plan, fall back to the original plan
    -      None
    -    } else {
    -      // There must be only one plan at the last level, which contains all items.
    -      assert(plansLastLevel.size == 1 && plansLastLevel.head._1.size == items.length)
    -      Some(plansLastLevel.head._2.plan)
    +    // Find the best plan
    +    assert(foundPlans.last.size <= 1)
    --- End diff --
    
    hmm you have a good point here. If we have several disconnect item sets, e.g. {AB} and {CD}, or a more complex case: {ABCD}, {EFG}, {LM}... These cases need to be dealt with.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106178276
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -19,31 +19,31 @@ package org.apache.spark.sql.catalyst.optimizer
     
     import scala.collection.mutable
     
    -import org.apache.spark.sql.catalyst.CatalystConf
     import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeSet, Expression, PredicateHelper}
     import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
     import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, LogicalPlan, Project}
     import org.apache.spark.sql.catalyst.rules.Rule
    +import org.apache.spark.sql.internal.SQLConf
     
     
     /**
      * Cost-based join reorder.
      * We may have several join reorder algorithms in the future. This class is the entry of these
      * algorithms, and chooses which one to use.
      */
    -case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper {
    +case class CostBasedJoinReorder(conf: SQLConf) extends Rule[LogicalPlan] with PredicateHelper {
       def apply(plan: LogicalPlan): LogicalPlan = {
         if (!conf.cboEnabled || !conf.joinReorderEnabled) {
           plan
         } else {
    -      val result = plan transform {
    +      val result = plan transformDown {
             case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
    --- End diff --
    
    Yea I think so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106357158
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -696,6 +696,13 @@ object SQLConf {
           .intConf
           .createWithDefault(12)
     
    +  val JOIN_REORDER_CARD_WEIGHT =
    +    buildConf("spark.sql.cbo.joinReorder.card.weight")
    +      .doc("The weight of cardinality (number of rows) for plan cost comparison in join reorder: " +
    +        "rows * weight + size * (1 - weight).")
    +      .doubleConf
    +      .createWithDefault(0.7)
    --- End diff --
    
    Yes, checking should be added, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74528/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    LGTM except some minor comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106179885
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -128,38 +128,43 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
     object JoinReorderDP extends PredicateHelper {
     
       def search(
    -      conf: CatalystConf,
    +      conf: SQLConf,
           items: Seq[LogicalPlan],
           conditions: Set[Expression],
           topOutput: AttributeSet): Option[LogicalPlan] = {
     
         // Level i maintains all found plans for i + 1 items.
         // Create the initial plans: each plan is a single item with zero cost.
    -    val itemIndex = items.zipWithIndex
    +    val itemIndex = items.zipWithIndex.map(_.swap).toMap
         val foundPlans = mutable.Buffer[JoinPlanMap](itemIndex.map {
    -      case (item, id) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    -    }.toMap)
    +      case (id, item) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    +    })
     
    -    for (lev <- 1 until items.length) {
    +    // Build plans for next levels until the last level has only one plan. This plan contains
    +    // all items that can be joined, so there's no need to continue.
    +    while (foundPlans.size < items.length && foundPlans.last.size > 1) {
           // Build plans for the next level.
           foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
         }
     
    -    val plansLastLevel = foundPlans(items.length - 1)
    -    if (plansLastLevel.isEmpty) {
    -      // Failed to find a plan, fall back to the original plan
    -      None
    -    } else {
    -      // There must be only one plan at the last level, which contains all items.
    -      assert(plansLastLevel.size == 1 && plansLastLevel.head._1.size == items.length)
    -      Some(plansLastLevel.head._2.plan)
    +    // Find the best plan
    +    assert(foundPlans.last.size <= 1)
    --- End diff --
    
    Seems excluding CP bring some complexity here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    **[Test build #74737 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74737/testReport)** for PR 17286 at commit [`aa75b6e`](https://github.com/apache/spark/commit/aa75b6edc0803b537c46a730a1cc9e2b8ac424a4).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    **[Test build #74527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74527/testReport)** for PR 17286 at commit [`9aee29a`](https://github.com/apache/spark/commit/9aee29a1d50c3a0a1b332f00ca5b41d5f0a2d877).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    **[Test build #74528 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74528/testReport)** for PR 17286 at commit [`dfc34fc`](https://github.com/apache/spark/commit/dfc34fccae97fa09b382dd215d09173ebf70b2de).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106355108
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -128,38 +128,43 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
     object JoinReorderDP extends PredicateHelper {
     
       def search(
    -      conf: CatalystConf,
    +      conf: SQLConf,
           items: Seq[LogicalPlan],
           conditions: Set[Expression],
           topOutput: AttributeSet): Option[LogicalPlan] = {
     
         // Level i maintains all found plans for i + 1 items.
         // Create the initial plans: each plan is a single item with zero cost.
    -    val itemIndex = items.zipWithIndex
    +    val itemIndex = items.zipWithIndex.map(_.swap).toMap
         val foundPlans = mutable.Buffer[JoinPlanMap](itemIndex.map {
    -      case (item, id) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    -    }.toMap)
    +      case (id, item) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    +    })
     
    -    for (lev <- 1 until items.length) {
    +    // Build plans for next levels until the last level has only one plan. This plan contains
    +    // all items that can be joined, so there's no need to continue.
    +    while (foundPlans.size < items.length && foundPlans.last.size > 1) {
           // Build plans for the next level.
           foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
         }
     
    -    val plansLastLevel = foundPlans(items.length - 1)
    -    if (plansLastLevel.isEmpty) {
    -      // Failed to find a plan, fall back to the original plan
    -      None
    -    } else {
    -      // There must be only one plan at the last level, which contains all items.
    -      assert(plansLastLevel.size == 1 && plansLastLevel.head._1.size == items.length)
    -      Some(plansLastLevel.head._2.plan)
    +    // Find the best plan
    +    assert(foundPlans.last.size <= 1)
    --- End diff --
    
    We still cannot deal with the case where several disconnected groups of items exist in the memo. 
    Join relations may exist within these groups. E.g.
    ```
    level 3: {ABCD}
    level 2: {EFG}, {HIJ}, {ABC}...
    level 1: {KL}, {MN}, {AB}, {BC}...
    level 0: ...
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106094256
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -128,38 +128,43 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
     object JoinReorderDP extends PredicateHelper {
     
       def search(
    -      conf: CatalystConf,
    +      conf: SQLConf,
           items: Seq[LogicalPlan],
           conditions: Set[Expression],
           topOutput: AttributeSet): Option[LogicalPlan] = {
     
         // Level i maintains all found plans for i + 1 items.
         // Create the initial plans: each plan is a single item with zero cost.
    -    val itemIndex = items.zipWithIndex
    +    val itemIndex = items.zipWithIndex.map(_.swap).toMap
         val foundPlans = mutable.Buffer[JoinPlanMap](itemIndex.map {
    -      case (item, id) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    -    }.toMap)
    +      case (id, item) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    +    })
     
    -    for (lev <- 1 until items.length) {
    +    // Build plans for next levels until the last level has only one plan. This plan contains
    +    // all items that can be joined, so there's no need to continue.
    +    while (foundPlans.size < items.length && foundPlans.last.size > 1) {
           // Build plans for the next level.
           foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
         }
     
    -    val plansLastLevel = foundPlans(items.length - 1)
    -    if (plansLastLevel.isEmpty) {
    -      // Failed to find a plan, fall back to the original plan
    -      None
    -    } else {
    -      // There must be only one plan at the last level, which contains all items.
    -      assert(plansLastLevel.size == 1 && plansLastLevel.head._1.size == items.length)
    -      Some(plansLastLevel.head._2.plan)
    +    // Find the best plan
    +    assert(foundPlans.last.size <= 1)
    --- End diff --
    
    can you answer this question: https://github.com/apache/spark/pull/17240#discussion_r105822819


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    My example is not related to inequality join or equi join.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106094103
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -19,31 +19,31 @@ package org.apache.spark.sql.catalyst.optimizer
     
     import scala.collection.mutable
     
    -import org.apache.spark.sql.catalyst.CatalystConf
     import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeSet, Expression, PredicateHelper}
     import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
     import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, LogicalPlan, Project}
     import org.apache.spark.sql.catalyst.rules.Rule
    +import org.apache.spark.sql.internal.SQLConf
     
     
     /**
      * Cost-based join reorder.
      * We may have several join reorder algorithms in the future. This class is the entry of these
      * algorithms, and chooses which one to use.
      */
    -case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper {
    +case class CostBasedJoinReorder(conf: SQLConf) extends Rule[LogicalPlan] with PredicateHelper {
       def apply(plan: LogicalPlan): LogicalPlan = {
         if (!conf.cboEnabled || !conf.joinReorderEnabled) {
           plan
         } else {
    -      val result = plan transform {
    +      val result = plan transformDown {
             case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
    --- End diff --
    
    shall we also check if `projectList` are all attributes here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    **[Test build #74737 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74737/testReport)** for PR 17286 at commit [`aa75b6e`](https://github.com/apache/spark/commit/aa75b6edc0803b537c46a730a1cc9e2b8ac424a4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106602975
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -710,6 +710,14 @@ object SQLConf {
           .intConf
           .createWithDefault(12)
     
    +  val JOIN_REORDER_CARD_WEIGHT =
    +    buildConf("spark.sql.cbo.joinReorder.card.weight")
    +      .doc("The weight of cardinality (number of rows) for plan cost comparison in join reorder: " +
    +        "rows * weight + size * (1 - weight).")
    +      .doubleConf
    +      .checkValue(weight => weight >= 0 && weight <= 1, "The weight value must be in [0, 1].")
    +      .createWithDefault(0.7)
    --- End diff --
    
    `0.7` is just an empirical value, I think it's better to be able to tune it. Maybe mark it internal?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106583013
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -203,64 +205,46 @@ object JoinReorderDP extends PredicateHelper {
       private def buildJoin(
           oneJoinPlan: JoinPlan,
           otherJoinPlan: JoinPlan,
    -      conf: CatalystConf,
    +      conf: SQLConf,
           conditions: Set[Expression],
    -      topOutput: AttributeSet): JoinPlan = {
    +      topOutput: AttributeSet): Option[JoinPlan] = {
     
         val onePlan = oneJoinPlan.plan
         val otherPlan = otherJoinPlan.plan
    -    // Now both onePlan and otherPlan become intermediate joins, so the cost of the
    -    // new join should also include their own cardinalities and sizes.
    -    val newCost = if (isCartesianProduct(onePlan) || isCartesianProduct(otherPlan)) {
    -      // We consider cartesian product very expensive, thus set a very large cost for it.
    -      // This enables to plan all the cartesian products at the end, because having a cartesian
    -      // product as an intermediate join will significantly increase a plan's cost, making it
    -      // impossible to be selected as the best plan for the items, unless there's no other choice.
    -      Cost(
    -        rows = BigInt(Long.MaxValue) * BigInt(Long.MaxValue),
    -        size = BigInt(Long.MaxValue) * BigInt(Long.MaxValue))
    -    } else {
    -      val onePlanStats = onePlan.stats(conf)
    -      val otherPlanStats = otherPlan.stats(conf)
    -      Cost(
    -        rows = oneJoinPlan.cost.rows + onePlanStats.rowCount.get +
    -          otherJoinPlan.cost.rows + otherPlanStats.rowCount.get,
    -        size = oneJoinPlan.cost.size + onePlanStats.sizeInBytes +
    -          otherJoinPlan.cost.size + otherPlanStats.sizeInBytes)
    -    }
    -
    -    // Put the deeper side on the left, tend to build a left-deep tree.
    -    val (left, right) = if (oneJoinPlan.itemIds.size >= otherJoinPlan.itemIds.size) {
    -      (onePlan, otherPlan)
    -    } else {
    -      (otherPlan, onePlan)
    -    }
         val joinConds = conditions
           .filterNot(l => canEvaluate(l, onePlan))
           .filterNot(r => canEvaluate(r, otherPlan))
           .filter(e => e.references.subsetOf(onePlan.outputSet ++ otherPlan.outputSet))
    -    // We use inner join whether join condition is empty or not. Since cross join is
    -    // equivalent to inner join without condition.
    -    val newJoin = Join(left, right, Inner, joinConds.reduceOption(And))
    -    val collectedJoinConds = joinConds ++ oneJoinPlan.joinConds ++ otherJoinPlan.joinConds
    -    val remainingConds = conditions -- collectedJoinConds
    -    val neededAttr = AttributeSet(remainingConds.flatMap(_.references)) ++ topOutput
    -    val neededFromNewJoin = newJoin.outputSet.filter(neededAttr.contains)
    -    val newPlan =
    -      if ((newJoin.outputSet -- neededFromNewJoin).nonEmpty) {
    -        Project(neededFromNewJoin.toSeq, newJoin)
    +    if (joinConds.isEmpty) {
    +      // Cartesian product is very expensive, so we exclude them from candidate plans.
    +      // This also significantly reduces the search space.
    --- End diff --
    
    great! now we can safely apply this optimization :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product results in ...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    please review @hvanhovell @cloud-fan @nsyca 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106603543
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinReorderSuite.scala ---
    @@ -187,6 +220,8 @@ class JoinReorderSuite extends PlanTest with StatsEstimationTestBase {
           case (j1: Join, j2: Join) =>
             (sameJoinPlan(j1.left, j2.left) && sameJoinPlan(j1.right, j2.right)) ||
               (sameJoinPlan(j1.left, j2.right) && sameJoinPlan(j1.right, j2.left))
    +      case _ if plan1.children.nonEmpty && plan2.children.nonEmpty =>
    --- End diff --
    
    If they are two projects, we should recurse into it's child. And we may have some tests with other operators in the future, so I didn't only match project here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106582668
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -185,11 +184,14 @@ object JoinReorderDP extends PredicateHelper {
               // Should not join two overlapping item sets.
               if (oneSidePlan.itemIds.intersect(otherSidePlan.itemIds).isEmpty) {
                 val joinPlan = buildJoin(oneSidePlan, otherSidePlan, conf, conditions, topOutput)
    -            // Check if it's the first plan for the item set, or it's a better plan than
    -            // the existing one due to lower cost.
    -            val existingPlan = nextLevel.get(joinPlan.itemIds)
    -            if (existingPlan.isEmpty || joinPlan.cost.lessThan(existingPlan.get.cost)) {
    -              nextLevel.update(joinPlan.itemIds, joinPlan)
    +            if (joinPlan.isDefined) {
    --- End diff --
    
    when will this condition be false?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74656/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    **[Test build #74528 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74528/testReport)** for PR 17286 at commit [`dfc34fc`](https://github.com/apache/spark/commit/dfc34fccae97fa09b382dd215d09173ebf70b2de).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by nsyca <gi...@git.apache.org>.

Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    @gatorsmile An equality join in most cases has a better filtering than an inequality join. This can be used heuristically. However, this is not always true. An equality join can be a lookup join from a fact table to a dimension table that it does not reduce anything but an inequality join such as a range join over time period. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    **[Test build #74741 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74741/testReport)** for PR 17286 at commit [`aa75b6e`](https://github.com/apache/spark/commit/aa75b6edc0803b537c46a730a1cc9e2b8ac424a4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    @ioana-delaney Thanks for the replies.
    
    > Given A J1 B J2 C:   - case 1
    \u2022	level 0: (A), (B), (C)
    \u2022	level 1: {A, B}, ~~{A, C}~~, {B, C}
    \u2022	level 3: {A, B, C}
    Given A J1 B J2 C cross join D  - case 2
    \u2022	level 0: (A), (B), (C), (D)
    \u2022	level 1: {A, B}, ~~{A, C}, {A, D}~~, {B, C}, ~~{B, D}~~
    \u2022	level 3: {A, B, C}, ~~{A, B, D}, {A, C, D}, {B, C, D}~~
    \u2022	level 4: {A, B, C, D}
    
    Currently the algorithm can prune cartesian products in case 1. For case 2, now it is based on the output of `ReorderJoin`, cartesian products are put at the end. So we have the same effect on the final plan.
    
    In the future, we can merge these rules together.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106775334
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -19,31 +19,33 @@ package org.apache.spark.sql.catalyst.optimizer
     
     import scala.collection.mutable
     
    -import org.apache.spark.sql.catalyst.CatalystConf
     import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeSet, Expression, PredicateHelper}
     import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
     import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, LogicalPlan, Project}
     import org.apache.spark.sql.catalyst.rules.Rule
    +import org.apache.spark.sql.internal.SQLConf
     
     
     /**
      * Cost-based join reorder.
      * We may have several join reorder algorithms in the future. This class is the entry of these
      * algorithms, and chooses which one to use.
      */
    -case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper {
    +case class CostBasedJoinReorder(conf: SQLConf) extends Rule[LogicalPlan] with PredicateHelper {
       def apply(plan: LogicalPlan): LogicalPlan = {
         if (!conf.cboEnabled || !conf.joinReorderEnabled) {
           plan
         } else {
    -      val result = plan transform {
    -        case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
    -          reorder(p, p.outputSet)
    -        case j @ Join(_, _, _: InnerLike, _) =>
    +      val result = plan transformDown {
    +        // Start reordering with a joinable item, which is an InnerLike join with conditions.
    +        case j @ Join(_, _, _: InnerLike, Some(cond)) =>
               reorder(j, j.outputSet)
    +        case p @ Project(projectList, Join(_, _, _: InnerLike, Some(cond)))
    +          if projectList.forall(_.isInstanceOf[Attribute]) =>
    +          reorder(p, p.outputSet)
           }
           // After reordering is finished, convert OrderedJoin back to Join
    -      result transform {
    +      result transformDown {
    --- End diff --
    
    Is the change needed? The order must be top-down?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    **[Test build #74500 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74500/testReport)** for PR 17286 at commit [`f472c74`](https://github.com/apache/spark/commit/f472c74189f5f54c701876aaa73f7d331e5b814d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class JoinPlan(`
      * `case class Cost(rows: BigInt, size: BigInt) `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106583140
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -710,6 +710,14 @@ object SQLConf {
           .intConf
           .createWithDefault(12)
     
    +  val JOIN_REORDER_CARD_WEIGHT =
    +    buildConf("spark.sql.cbo.joinReorder.card.weight")
    +      .doc("The weight of cardinality (number of rows) for plan cost comparison in join reorder: " +
    +        "rows * weight + size * (1 - weight).")
    +      .doubleConf
    +      .checkValue(weight => weight >= 0 && weight <= 1, "The weight value must be in [0, 1].")
    +      .createWithDefault(0.7)
    --- End diff --
    
    it is useful to expose this config? I think most of the users will just disable join reordering if they have problems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by ioana-delaney <gi...@git.apache.org>.

Github user ioana-delaney commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    @gatorsmile Your example is correct. 
    
    Given A J1 B J2 C:
    \u2022	level 0: (A), (B), (C)
    \u2022	level 1: {A, B}, ~{A, C}~, {B, C}
    \u2022	level 3: {A, B, C}
    
    Given A J1 B J2 C cross join D 
    \u2022	level 0: (A), (B), (C), (D)
    \u2022	level 1: {A, B}, ~{A, C}~, ~{A, D}~, {B, C}, ~{B, D}~
    \u2022	level 3: {A, B, C}, ~{A, B, D}~, ~{A, C, D}~, ~{B, C, D}~
    \u2022	level 4: {A, B, C, D}
    
    The presence of join conditions between the outer-inner sets are good heuristics to be used during join enumeration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106341181
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -696,6 +696,13 @@ object SQLConf {
           .intConf
           .createWithDefault(12)
     
    +  val JOIN_REORDER_CARD_WEIGHT =
    +    buildConf("spark.sql.cbo.joinReorder.card.weight")
    +      .doc("The weight of cardinality (number of rows) for plan cost comparison in join reorder: " +
    +        "rows * weight + size * (1 - weight).")
    +      .doubleConf
    +      .createWithDefault(0.7)
    --- End diff --
    
    What is boundary of this? adding `check`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product results in ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    **[Test build #74500 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74500/testReport)** for PR 17286 at commit [`f472c74`](https://github.com/apache/spark/commit/f472c74189f5f54c701876aaa73f7d331e5b814d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Co-existing cross joins (join without a condition) and inner joins make the reordering procedure cumbersome. On one hand, putting cross join candidates into memo is not good in terms of search performance and memory consumption. On the other hand, adding cross joins after inner join exploration also has problems with multiple unjoinable groups, just putting them at the end of the plan is not right.
    
    So I decide to do reordering only for consecutive inner joins, which are separated by other plans (including cross joins). For bushy trees, this means we will reorder joins within each inner joinable "groups". For left deep trees, the `ReorderJoin` rule puts cross joins to the end, so all the previous inner joins will be reordered. Since join reorder is a complicated problem, we may need several levels of optimization, we can consider `ReorderJoin` as a heuristic algorithm.
    
    For common cases when there's no cross join, this algorithm behaves just like before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74500/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106775016
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -19,31 +19,33 @@ package org.apache.spark.sql.catalyst.optimizer
     
     import scala.collection.mutable
     
    -import org.apache.spark.sql.catalyst.CatalystConf
     import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeSet, Expression, PredicateHelper}
     import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
     import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, LogicalPlan, Project}
     import org.apache.spark.sql.catalyst.rules.Rule
    +import org.apache.spark.sql.internal.SQLConf
     
     
     /**
      * Cost-based join reorder.
      * We may have several join reorder algorithms in the future. This class is the entry of these
      * algorithms, and chooses which one to use.
      */
    -case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper {
    +case class CostBasedJoinReorder(conf: SQLConf) extends Rule[LogicalPlan] with PredicateHelper {
       def apply(plan: LogicalPlan): LogicalPlan = {
         if (!conf.cboEnabled || !conf.joinReorderEnabled) {
           plan
         } else {
    -      val result = plan transform {
    -        case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
    -          reorder(p, p.outputSet)
    -        case j @ Join(_, _, _: InnerLike, _) =>
    +      val result = plan transformDown {
    +        // Start reordering with a joinable item, which is an InnerLike join with conditions.
    +        case j @ Join(_, _, _: InnerLike, Some(cond)) =>
    --- End diff --
    
    I have a very general question about how to exclude cartesian product. See my old PR: #16762
    
    Having a join condition does not mean it is not a Cartesian product. : )
    
    Only having non-equal predicates in join conditions for an inner join. e.g., tab1.a < tab2.b


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    **[Test build #74656 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74656/testReport)** for PR 17286 at commit [`237ad9d`](https://github.com/apache/spark/commit/237ad9d1575b3b08a60e6f0e0160ab9636b1453c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by ioana-delaney <gi...@git.apache.org>.

Github user ioana-delaney commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    @wzhfy Some thoughts on how to solve the Cartesian problem as part of the join enumeration algorithm is to apply a similar strategy to the one that we discuss for star-plans. You keep track of "connected" tables and "unconnected" tables. During join enumeration, mixed combinations are pruned until the plan for the \u201cconnected\u201d set of tables was built. Then, we add tables from the "unconnected" set - maybe only as left-deep trees (i.e. the size of the inner is one). Also, knowing that a set of tables are connected through join conditions, will allow further plan pruning based on the presence of join predicates. Integration with star-join, and probably other heuristics, would require to introduce some filtering/pruning strategies on top of the search engine. Just some thoughts\u2026 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by ioana-delaney <gi...@git.apache.org>.

Github user ioana-delaney commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    @wzhfy Given a set of input plans (either base table access or plans over derived/complex plans), one can build a graph based on the join conditions among the plans. I think join enumeration should be driven by the presence of join conditions between two plans. Following this approach, I think we can detect items that are not connected and plan them towards the end.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106583389
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinReorderSuite.scala ---
    @@ -187,6 +220,8 @@ class JoinReorderSuite extends PlanTest with StatsEstimationTestBase {
           case (j1: Join, j2: Join) =>
             (sameJoinPlan(j1.left, j2.left) && sameJoinPlan(j1.right, j2.right)) ||
               (sameJoinPlan(j1.left, j2.right) && sameJoinPlan(j1.right, j2.left))
    +      case _ if plan1.children.nonEmpty && plan2.children.nonEmpty =>
    --- End diff --
    
    when will we hit this branch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74741/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17286
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org