You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wzhfy <gi...@git.apache.org> on 2017/03/10 09:53:57 UTC

[GitHub] spark pull request #17240: [SPARK-17080][SQL][followup] simplify algorithm

GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/17240

    [SPARK-17080][SQL][followup] simplify algorithm

    
    ## What changes were proposed in this pull request?
    
    1. Do column pruning during reordering is troublesome. We can do it right after reordering, then logics for adding projects on intermediate joins can be removed. This makes the code simpler and more reliable.
    2. Usually cardinality is more important than size, we can simplify cost evaluation by using only cardinality. Note that this enables us to not care about column pruing during reordering (the first point). Otherwise, project will influence the output size of intermediate joins.
    3. Exclude cartesian products in the "memo". This significantly reduce the search space. Otherwise every combination of items will exist in the memo.  We can find those unjoinable items after reordering is finished and put them at the end.
    
    ## How was this patch tested?
    
    Not related.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark joinReorder2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17240.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17240
    
----
commit 94bc9355b647f181549c8a11d77b00237ea4c528
Author: wangzhenhua <wa...@huawei.com>
Date:   2017-03-10T08:33:46Z

    1.don't deal with projects during reordering 2.exclude cartesian products during reordering 3.only use cardinality as cost.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-19915][SQL] Improve join reorder: simplify cost e...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    @nsyca Thanks. I know there could be such cases when size is also useful. However, usually big tables (fact table) have more columns than small tables, so cardinality and size is positively correlated, i.e. relation with larger cardinality also has larger size.
    Again, I agree with you in some cases this could be violated. But we also need to consider from implementation aspect. Do column pruning after reordering makes more sense and makes the code much conciser.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17240#discussion_r105815087
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -204,63 +206,37 @@ object JoinReorderDP extends PredicateHelper {
           oneJoinPlan: JoinPlan,
           otherJoinPlan: JoinPlan,
           conf: CatalystConf,
    -      conditions: Set[Expression],
    -      topOutput: AttributeSet): JoinPlan = {
    +      conditions: Set[Expression]): Option[JoinPlan] = {
     
         val onePlan = oneJoinPlan.plan
         val otherPlan = otherJoinPlan.plan
    -    // Now both onePlan and otherPlan become intermediate joins, so the cost of the
    -    // new join should also include their own cardinalities and sizes.
    -    val newCost = if (isCartesianProduct(onePlan) || isCartesianProduct(otherPlan)) {
    -      // We consider cartesian product very expensive, thus set a very large cost for it.
    -      // This enables to plan all the cartesian products at the end, because having a cartesian
    -      // product as an intermediate join will significantly increase a plan's cost, making it
    -      // impossible to be selected as the best plan for the items, unless there's no other choice.
    -      Cost(
    -        rows = BigInt(Long.MaxValue) * BigInt(Long.MaxValue),
    -        size = BigInt(Long.MaxValue) * BigInt(Long.MaxValue))
    -    } else {
    -      val onePlanStats = onePlan.stats(conf)
    -      val otherPlanStats = otherPlan.stats(conf)
    -      Cost(
    -        rows = oneJoinPlan.cost.rows + onePlanStats.rowCount.get +
    -          otherJoinPlan.cost.rows + otherPlanStats.rowCount.get,
    -        size = oneJoinPlan.cost.size + onePlanStats.sizeInBytes +
    -          otherJoinPlan.cost.size + otherPlanStats.sizeInBytes)
    -    }
    -
    -    // Put the deeper side on the left, tend to build a left-deep tree.
    -    val (left, right) = if (oneJoinPlan.itemIds.size >= otherJoinPlan.itemIds.size) {
    -      (onePlan, otherPlan)
    -    } else {
    -      (otherPlan, onePlan)
    -    }
         val joinConds = conditions
           .filterNot(l => canEvaluate(l, onePlan))
           .filterNot(r => canEvaluate(r, otherPlan))
           .filter(e => e.references.subsetOf(onePlan.outputSet ++ otherPlan.outputSet))
    -    // We use inner join whether join condition is empty or not. Since cross join is
    -    // equivalent to inner join without condition.
    -    val newJoin = Join(left, right, Inner, joinConds.reduceOption(And))
    -    val collectedJoinConds = joinConds ++ oneJoinPlan.joinConds ++ otherJoinPlan.joinConds
    -    val remainingConds = conditions -- collectedJoinConds
    -    val neededAttr = AttributeSet(remainingConds.flatMap(_.references)) ++ topOutput
    -    val neededFromNewJoin = newJoin.outputSet.filter(neededAttr.contains)
    -    val newPlan =
    -      if ((newJoin.outputSet -- neededFromNewJoin).nonEmpty) {
    -        Project(neededFromNewJoin.toSeq, newJoin)
    +    if (joinConds.isEmpty) {
    +      // Cartesian product is very expensive, so we exclude them from candidate plans.
    +      // This also helps us to reduce the search space. Unjoinable items will be put at the end
    +      // of the plan when the reordering phase finishes.
    +      None
    +    } else {
    +      // Put the deeper side on the left, tend to build a left-deep tree.
    +      val (left, right) = if (oneJoinPlan.itemIds.size >= otherJoinPlan.itemIds.size) {
    +        (onePlan, otherPlan)
           } else {
    -        newJoin
    +        (otherPlan, onePlan)
           }
    +      val newJoin = Join(left, right, Inner, joinConds.reduceOption(And))
    +      val itemIds = oneJoinPlan.itemIds.union(otherJoinPlan.itemIds)
     
    -    val itemIds = oneJoinPlan.itemIds.union(otherJoinPlan.itemIds)
    -    JoinPlan(itemIds, newPlan, collectedJoinConds, newCost)
    -  }
    +      // Now onePlan/otherPlan becomes an intermediate join (if it's a non-leaf item),
    +      // so the cost of the new join should also include their own cardinalities.
    +      val newCost = oneJoinPlan.cost + otherJoinPlan.cost +
    +        (if (oneJoinPlan.itemIds.size > 1) onePlan.stats(conf).rowCount.get else 0) +
    +        (if (otherJoinPlan.itemIds.size > 1) otherPlan.stats(conf).rowCount.get else 0)
    --- End diff --
    
    Filtering factor is considered in `def stats`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-17080][SQL][followup] simplify algorithm

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17240#discussion_r105368297
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -87,8 +88,8 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
           val replacedLeft = replaceWithOrderedJoin(left)
           val replacedRight = replaceWithOrderedJoin(right)
           OrderedJoin(j.copy(left = replacedLeft, right = replacedRight))
    -    case p @ Project(_, join) =>
    -      p.copy(child = replaceWithOrderedJoin(join))
    +    case p @ Project(_, j @ Join(_, _, _, _)) =>
    --- End diff --
    
    NIT: `case Project(projectList), j: Join) =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-17080][SQL][followup] simplify algorithm

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    **[Test build #74314 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74314/testReport)** for PR 17240 at commit [`94bc935`](https://github.com/apache/spark/commit/94bc9355b647f181549c8a11d77b00237ea4c528).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17240#discussion_r105822819
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -122,46 +119,48 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
      * level 3: p({A, B, C, D})
      * where p({A, B, C, D}) is the final output plan.
      *
    - * For cost evaluation, since physical costs for operators are not available currently, we use
    - * cardinalities and sizes to compute costs.
    + * To evaluate cost for a given plan, we calculate the sum of cardinalities for all intermediate
    + * joins in the plan.
      */
     object JoinReorderDP extends PredicateHelper {
     
       def search(
           conf: CatalystConf,
           items: Seq[LogicalPlan],
    -      conditions: Set[Expression],
    -      topOutput: AttributeSet): Option[LogicalPlan] = {
    +      conditions: Set[Expression]): Option[LogicalPlan] = {
     
         // Level i maintains all found plans for i + 1 items.
         // Create the initial plans: each plan is a single item with zero cost.
    -    val itemIndex = items.zipWithIndex
    +    val itemIndex = items.zipWithIndex.map(_.swap).toMap
         val foundPlans = mutable.Buffer[JoinPlanMap](itemIndex.map {
    -      case (item, id) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    -    }.toMap)
    +      case (id, item) => Set(id) -> JoinPlan(Set(id), item, cost = 0)
    +    })
     
    -    for (lev <- 1 until items.length) {
    +    while (foundPlans.size < items.length && foundPlans.last.size > 1) {
           // Build plans for the next level.
    -      foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
    +      foundPlans += searchLevel(foundPlans, conf, conditions)
         }
     
    -    val plansLastLevel = foundPlans(items.length - 1)
    -    if (plansLastLevel.isEmpty) {
    -      // Failed to find a plan, fall back to the original plan
    -      None
    -    } else {
    -      // There must be only one plan at the last level, which contains all items.
    -      assert(plansLastLevel.size == 1 && plansLastLevel.head._1.size == items.length)
    -      Some(plansLastLevel.head._2.plan)
    +    // Find the best plan
    +    assert(foundPlans.last.size <= 1)
    +    val bestJoinPlan = foundPlans.last.headOption
    --- End diff --
    
    what if the last level has 0 entry but the previous level has some?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17240#discussion_r105822517
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -122,46 +119,48 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
      * level 3: p({A, B, C, D})
      * where p({A, B, C, D}) is the final output plan.
      *
    - * For cost evaluation, since physical costs for operators are not available currently, we use
    - * cardinalities and sizes to compute costs.
    + * To evaluate cost for a given plan, we calculate the sum of cardinalities for all intermediate
    + * joins in the plan.
      */
     object JoinReorderDP extends PredicateHelper {
     
       def search(
           conf: CatalystConf,
           items: Seq[LogicalPlan],
    -      conditions: Set[Expression],
    -      topOutput: AttributeSet): Option[LogicalPlan] = {
    +      conditions: Set[Expression]): Option[LogicalPlan] = {
     
         // Level i maintains all found plans for i + 1 items.
         // Create the initial plans: each plan is a single item with zero cost.
    -    val itemIndex = items.zipWithIndex
    +    val itemIndex = items.zipWithIndex.map(_.swap).toMap
         val foundPlans = mutable.Buffer[JoinPlanMap](itemIndex.map {
    -      case (item, id) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    -    }.toMap)
    +      case (id, item) => Set(id) -> JoinPlan(Set(id), item, cost = 0)
    +    })
     
    -    for (lev <- 1 until items.length) {
    +    while (foundPlans.size < items.length && foundPlans.last.size > 1) {
    --- End diff --
    
    add some comments to explain why we can stop when the last level has less than 1 entry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-17080][SQL][followup] simplify algorithm

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17240#discussion_r105368556
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -122,46 +123,59 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
      * level 3: p({A, B, C, D})
      * where p({A, B, C, D}) is the final output plan.
      *
    - * For cost evaluation, since physical costs for operators are not available currently, we use
    - * cardinalities and sizes to compute costs.
    + * To evaluate cost for a given plan, we calculate the sum of cardinalities for all intermediate
    + * joins in the plan.
      */
     object JoinReorderDP extends PredicateHelper {
     
       def search(
           conf: CatalystConf,
           items: Seq[LogicalPlan],
    -      conditions: Set[Expression],
    -      topOutput: AttributeSet): Option[LogicalPlan] = {
    +      conditions: Set[Expression]): Option[LogicalPlan] = {
     
         // Level i maintains all found plans for i + 1 items.
         // Create the initial plans: each plan is a single item with zero cost.
    -    val itemIndex = items.zipWithIndex
    +    val itemIndex = items.zipWithIndex.map(e => (e._2, e._1)).toMap
    --- End diff --
    
    NIT: use swap: `items.zipWithIndex.map(_.swap).toMap`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-17080][SQL][followup] simplify algorithm

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    This looks like a bunch of change for an addendum -- should this not have gone into the original change? the title isn't descriptive here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy closed the pull request at:

    https://github.com/apache/spark/pull/17240


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-19915][SQL] Improve join reorder: simplify cost e...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-17080][SQL][followup] simplify algorithm

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17240#discussion_r105368212
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -75,8 +75,9 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
             val (rightPlans, rightConditions) = extractInnerJoins(right)
             (leftPlans ++ rightPlans, cond.toSet.flatMap(splitConjunctivePredicates) ++
               leftConditions ++ rightConditions)
    -      case Project(projectList, join) if projectList.forall(_.isInstanceOf[Attribute]) =>
    -        extractInnerJoins(join)
    +      case Project(projectList, j @ Join(_, _, _, _))
    --- End diff --
    
    NIT: `case Project(projectList), j: Join) if projectList.forall(_.isInstanceOf[Attribute]) =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-17080][SQL][followup] 1.Postpone column pr...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17240#discussion_r105371649
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -122,46 +123,59 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
      * level 3: p({A, B, C, D})
      * where p({A, B, C, D}) is the final output plan.
      *
    - * For cost evaluation, since physical costs for operators are not available currently, we use
    - * cardinalities and sizes to compute costs.
    + * To evaluate cost for a given plan, we calculate the sum of cardinalities for all intermediate
    + * joins in the plan.
      */
     object JoinReorderDP extends PredicateHelper {
     
       def search(
           conf: CatalystConf,
           items: Seq[LogicalPlan],
    -      conditions: Set[Expression],
    -      topOutput: AttributeSet): Option[LogicalPlan] = {
    +      conditions: Set[Expression]): Option[LogicalPlan] = {
     
         // Level i maintains all found plans for i + 1 items.
         // Create the initial plans: each plan is a single item with zero cost.
    -    val itemIndex = items.zipWithIndex
    +    val itemIndex = items.zipWithIndex.map(e => (e._2, e._1)).toMap
         val foundPlans = mutable.Buffer[JoinPlanMap](itemIndex.map {
    -      case (item, id) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    -    }.toMap)
    +      case (id, item) => Set(id) -> JoinPlan(Set(id), item, cost = 0)
    +    })
     
         for (lev <- 1 until items.length) {
           // Build plans for the next level.
    -      foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
    +      foundPlans += searchLevel(foundPlans, conf, conditions)
         }
     
    -    val plansLastLevel = foundPlans(items.length - 1)
    -    if (plansLastLevel.isEmpty) {
    -      // Failed to find a plan, fall back to the original plan
    -      None
    -    } else {
    -      // There must be only one plan at the last level, which contains all items.
    -      assert(plansLastLevel.size == 1 && plansLastLevel.head._1.size == items.length)
    -      Some(plansLastLevel.head._2.plan)
    +    // Find the best plan
    +    var level = items.length - 1
    +    var found = false
    +    var bestJoinPlan: Option[JoinPlan] = None
    +    while (!found && level >= 0) {
    --- End diff --
    
    It would be better to do this when you are building the plan, e.g.:
    ```scala
    while (foundPlans.size < items.length && foundPlans.last.size > 1) {
      foundPlans += searchLevel(foundPlans, conf, conditions)
    }
    assert(foundPlans.last.size <= 1)
    val bestJoinPlan = foundPlans.last.headOption
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-19915][SQL] Improve join reorder: simplify cost e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    **[Test build #74368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74368/testReport)** for PR 17240 at commit [`82a1740`](https://github.com/apache/spark/commit/82a17405a9abc0e5690e46855011e3755c2d8f90).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-19915][SQL] Improve join reorder: simplify cost e...

Posted by nsyca <gi...@git.apache.org>.
Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    @wzhfy wrote:
    *"usually big tables (fact table) have more columns than small tables, so cardinality and size is positively correlated"*
    
    I am aware this may not be a forum to discuss design issue. I will make it brief. In general, fact tables contain the columns associated to the primary key columns of the dimension tables plus a few columns (usually numbers), and the dimension tables contains the descprtions of the entities they represent. An example is the STORE_SALES and STORE tables in TPC DS schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17240#discussion_r106772853
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -36,27 +36,24 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
         if (!conf.cboEnabled || !conf.joinReorderEnabled) {
           plan
         } else {
    -      val result = plan transform {
    -        case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
    -          reorder(p, p.outputSet)
    -        case j @ Join(_, _, _: InnerLike, _) =>
    -          reorder(j, j.outputSet)
    +      val result = plan transformDown {
    +        case j @ Join(_, _, _: InnerLike, _) => reorder(j)
           }
           // After reordering is finished, convert OrderedJoin back to Join
    -      result transform {
    +      result transformDown {
             case oj: OrderedJoin => oj.join
           }
         }
       }
     
    -  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
    +  def reorder(plan: LogicalPlan): LogicalPlan = {
    --- End diff --
    
    private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-19915][SQL] Improve join reorder: simplify cost e...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    Since [the other pr](https://github.com/apache/spark/pull/17286) for the same jira has been merged, I'm closing this. Thanks for reviews and discussions! If you have any more comments, I'll fix them in a follow up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-17080][SQL][followup] 1.Postpone column pruning a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74314/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

Posted by nsyca <gi...@git.apache.org>.
Github user nsyca commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17240#discussion_r105748347
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -204,63 +206,37 @@ object JoinReorderDP extends PredicateHelper {
           oneJoinPlan: JoinPlan,
           otherJoinPlan: JoinPlan,
           conf: CatalystConf,
    -      conditions: Set[Expression],
    -      topOutput: AttributeSet): JoinPlan = {
    +      conditions: Set[Expression]): Option[JoinPlan] = {
     
         val onePlan = oneJoinPlan.plan
         val otherPlan = otherJoinPlan.plan
    -    // Now both onePlan and otherPlan become intermediate joins, so the cost of the
    -    // new join should also include their own cardinalities and sizes.
    -    val newCost = if (isCartesianProduct(onePlan) || isCartesianProduct(otherPlan)) {
    -      // We consider cartesian product very expensive, thus set a very large cost for it.
    -      // This enables to plan all the cartesian products at the end, because having a cartesian
    -      // product as an intermediate join will significantly increase a plan's cost, making it
    -      // impossible to be selected as the best plan for the items, unless there's no other choice.
    -      Cost(
    -        rows = BigInt(Long.MaxValue) * BigInt(Long.MaxValue),
    -        size = BigInt(Long.MaxValue) * BigInt(Long.MaxValue))
    -    } else {
    -      val onePlanStats = onePlan.stats(conf)
    -      val otherPlanStats = otherPlan.stats(conf)
    -      Cost(
    -        rows = oneJoinPlan.cost.rows + onePlanStats.rowCount.get +
    -          otherJoinPlan.cost.rows + otherPlanStats.rowCount.get,
    -        size = oneJoinPlan.cost.size + onePlanStats.sizeInBytes +
    -          otherJoinPlan.cost.size + otherPlanStats.sizeInBytes)
    -    }
    -
    -    // Put the deeper side on the left, tend to build a left-deep tree.
    -    val (left, right) = if (oneJoinPlan.itemIds.size >= otherJoinPlan.itemIds.size) {
    -      (onePlan, otherPlan)
    -    } else {
    -      (otherPlan, onePlan)
    -    }
         val joinConds = conditions
           .filterNot(l => canEvaluate(l, onePlan))
           .filterNot(r => canEvaluate(r, otherPlan))
           .filter(e => e.references.subsetOf(onePlan.outputSet ++ otherPlan.outputSet))
    -    // We use inner join whether join condition is empty or not. Since cross join is
    -    // equivalent to inner join without condition.
    -    val newJoin = Join(left, right, Inner, joinConds.reduceOption(And))
    -    val collectedJoinConds = joinConds ++ oneJoinPlan.joinConds ++ otherJoinPlan.joinConds
    -    val remainingConds = conditions -- collectedJoinConds
    -    val neededAttr = AttributeSet(remainingConds.flatMap(_.references)) ++ topOutput
    -    val neededFromNewJoin = newJoin.outputSet.filter(neededAttr.contains)
    -    val newPlan =
    -      if ((newJoin.outputSet -- neededFromNewJoin).nonEmpty) {
    -        Project(neededFromNewJoin.toSeq, newJoin)
    +    if (joinConds.isEmpty) {
    +      // Cartesian product is very expensive, so we exclude them from candidate plans.
    +      // This also helps us to reduce the search space. Unjoinable items will be put at the end
    +      // of the plan when the reordering phase finishes.
    +      None
    +    } else {
    +      // Put the deeper side on the left, tend to build a left-deep tree.
    +      val (left, right) = if (oneJoinPlan.itemIds.size >= otherJoinPlan.itemIds.size) {
    +        (onePlan, otherPlan)
           } else {
    -        newJoin
    +        (otherPlan, onePlan)
           }
    +      val newJoin = Join(left, right, Inner, joinConds.reduceOption(And))
    +      val itemIds = oneJoinPlan.itemIds.union(otherJoinPlan.itemIds)
     
    -    val itemIds = oneJoinPlan.itemIds.union(otherJoinPlan.itemIds)
    -    JoinPlan(itemIds, newPlan, collectedJoinConds, newCost)
    -  }
    +      // Now onePlan/otherPlan becomes an intermediate join (if it's a non-leaf item),
    +      // so the cost of the new join should also include their own cardinalities.
    +      val newCost = oneJoinPlan.cost + otherJoinPlan.cost +
    +        (if (oneJoinPlan.itemIds.size > 1) onePlan.stats(conf).rowCount.get else 0) +
    +        (if (otherJoinPlan.itemIds.size > 1) otherPlan.stats(conf).rowCount.get else 0)
    --- End diff --
    
    This may not a problem of this PR but the way the rowCount does not take the filtering of any local predicate into account, it will make the outcome of this join reorder less reliable.
    
    For example:
    `select * from t1, t2, t3 where t1.c1 = t2.c1 and t1.c2 = t3.c2`
    and
    `select * from t1, t2, t3 where t1.c1 = t2.c1 and t1.c2 = t3.c2 and t2.c2=1`
    
    The rowCount from relation T2 in both cases to this join reordering code are equal. With the presence of the predicate `t2.c2=1`, the cardinality of the input to the join from T2 could be substantially smaller.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17240#discussion_r105821744
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -87,8 +84,8 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
           val replacedLeft = replaceWithOrderedJoin(left)
           val replacedRight = replaceWithOrderedJoin(right)
           OrderedJoin(j.copy(left = replacedLeft, right = replacedRight))
    -    case p @ Project(_, join) =>
    -      p.copy(child = replaceWithOrderedJoin(join))
    +    case p @ Project(projectList, j: Join) =>
    --- End diff --
    
    now the result of join reordering won't have project, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-17080][SQL][followup] simplify algorithm

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    also cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17240#discussion_r105823209
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -122,46 +119,48 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
      * level 3: p({A, B, C, D})
      * where p({A, B, C, D}) is the final output plan.
      *
    - * For cost evaluation, since physical costs for operators are not available currently, we use
    - * cardinalities and sizes to compute costs.
    + * To evaluate cost for a given plan, we calculate the sum of cardinalities for all intermediate
    + * joins in the plan.
      */
     object JoinReorderDP extends PredicateHelper {
     
       def search(
           conf: CatalystConf,
           items: Seq[LogicalPlan],
    -      conditions: Set[Expression],
    -      topOutput: AttributeSet): Option[LogicalPlan] = {
    +      conditions: Set[Expression]): Option[LogicalPlan] = {
     
         // Level i maintains all found plans for i + 1 items.
         // Create the initial plans: each plan is a single item with zero cost.
    -    val itemIndex = items.zipWithIndex
    +    val itemIndex = items.zipWithIndex.map(_.swap).toMap
         val foundPlans = mutable.Buffer[JoinPlanMap](itemIndex.map {
    -      case (item, id) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 0))
    -    }.toMap)
    +      case (id, item) => Set(id) -> JoinPlan(Set(id), item, cost = 0)
    +    })
     
    -    for (lev <- 1 until items.length) {
    +    while (foundPlans.size < items.length && foundPlans.last.size > 1) {
           // Build plans for the next level.
    -      foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
    +      foundPlans += searchLevel(foundPlans, conf, conditions)
         }
     
    -    val plansLastLevel = foundPlans(items.length - 1)
    -    if (plansLastLevel.isEmpty) {
    -      // Failed to find a plan, fall back to the original plan
    -      None
    -    } else {
    -      // There must be only one plan at the last level, which contains all items.
    -      assert(plansLastLevel.size == 1 && plansLastLevel.head._1.size == items.length)
    -      Some(plansLastLevel.head._2.plan)
    +    // Find the best plan
    +    assert(foundPlans.last.size <= 1)
    +    val bestJoinPlan = foundPlans.last.headOption
    --- End diff --
    
    and what if the last level has more than one entries? shall we pick the best among them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-17080][SQL][followup] 1.Postpone column pruning a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-17080][SQL][followup] 1.Postpone column pruning a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    **[Test build #74314 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74314/testReport)** for PR 17240 at commit [`94bc935`](https://github.com/apache/spark/commit/94bc9355b647f181549c8a11d77b00237ea4c528).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class JoinPlan(itemIds: Set[Int], plan: LogicalPlan, cost: BigInt)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-19915][SQL] Improve join reorder: simplify cost e...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74368/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-19915][SQL] Improve join reorder: simplify cost e...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    **[Test build #74368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74368/testReport)** for PR 17240 at commit [`82a1740`](https://github.com/apache/spark/commit/82a17405a9abc0e5690e46855011e3755c2d8f90).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17240: [SPARK-17080][SQL][followup] simplify algorithm

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17240#discussion_r105368096
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala ---
    @@ -36,27 +36,27 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi
         if (!conf.cboEnabled || !conf.joinReorderEnabled) {
           plan
         } else {
    -      val result = plan transform {
    -        case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
    -          reorder(p, p.outputSet)
    -        case j @ Join(_, _, _: InnerLike, _) =>
    -          reorder(j, j.outputSet)
    +      val ordered = plan transformDown {
    +        case j @ Join(_, _, _: InnerLike, _) => reorder(j)
           }
           // After reordering is finished, convert OrderedJoin back to Join
    -      result transform {
    +      val result = ordered transformDown {
             case oj: OrderedJoin => oj.join
           }
    +      // Since we don't care about projected attributes during join reordering, we need to prune
    +      // columns here.
    +      ColumnPruning(result)
    --- End diff --
    
    Just place column pruning in the same batch as `CostBasedJoinReorder`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-19915][SQL] Improve join reorder: simplify cost e...

Posted by nsyca <gi...@git.apache.org>.
Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    In the PR description:
    *"Usually cardinality is more important than size, we can simplify cost evaluation by using only cardinality. Note that this also enables us to not care about column pru(n)ing during reordering. Because otherwise, project will influence the output size of intermediate joins."*
    
    I do not quite agree with this statement. If we have two candidates in the join reordering
    
    T1 join T2_100_columns T2 on p(T1,T2) join T3_1_column T3 on p(T1,T3)... [1]
    and
    T1 join T3_1_column T3 on p(T1,T3) join T2_100_columns T2 on p(T1,T2) ... [2]
    
    assuming all columns are of equal width (say, all are integer), T2_100_columns and T3_1_column has 100 columns and 1 column carried along to the input of the next operator above the 3-table join, and, the estimate cardinality of [1] and [2] are 10 and 20 rows respectively. Without taking the average widths of the tables into account, the join reordering algorithm will favour [1] over [2] for the first join, which will carry along the payload of the extra 100 columns from T2 to the second join.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17240: [SPARK-19915][SQL] Improve join reorder: simplify cost e...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/17240
  
    @hvanhovell why simplifying the cost function:
    1. Simplify cost by removing size, then we don't have to care about column pruning while building new join nodes,  because otherwise columm pruning will influence the size of intermediate output and thus the cost of plan. This makes the code much conciser. And I think ordering join first and then pruning columns is easier and more natural.
    2. Cardinality is often much more important than size in join reorder. BTW, Postgresql also uses cardinality (and physical costs) in join reorder.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org