You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by wzhfy <gi...@git.apache.org> on 2017/05/09 08:39:20 UTC

[GitHub] spark pull request #17918: [SPARK-20678][SQL] Ndv for columns not in filter ...

GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/17918

    [SPARK-20678][SQL] Ndv for columns not in filter condition should also be updated

    ## What changes were proposed in this pull request?
    
    In filter estimation, we update column stats for those columns in filter condition. However, if the number of rows decreases after the filter (i.e. the overall selectivity is less than 1), we need to update (scale down) the number of distinct values (NDV) for all columns, no matter they are in filter conditions or not.
    
    This pr also fixes the inconsistency of rounding mode for ndv and rowCount.
    
    ## How was this patch tested?
    
    Added new tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark scaleDownNdvAfterFilter

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17918.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17918
    
----
commit 39b72caff37dc3da1e50beccfbe58f56cedf8007
Author: wangzhenhua <wa...@huawei.com>
Date:   2017-05-09T08:25:37Z

    update ndv for all columns

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    I agree that we need to scale down NDV for all the referenced columns in a query if a filter condition reduces the number of qualified rows.  Do you find this problem when running tpc-ds benchmark?  If so, what queries encounters this issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    @ron8hu You can look at tpcds q18.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17918: [SPARK-20678][SQL] Ndv for columns not in filter ...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17918#discussion_r115664061
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala ---
    @@ -217,32 +217,18 @@ case class InnerOuterEstimation(conf: SQLConf, join: Join) extends Logging {
           if (joinKeyStats.contains(a)) {
             outputAttrStats += a -> joinKeyStats(a)
           } else {
    -        val leftRatio = if (leftRows != 0) {
    -          BigDecimal(outputRows) / BigDecimal(leftRows)
    -        } else {
    -          BigDecimal(0)
    -        }
    -        val rightRatio = if (rightRows != 0) {
    -          BigDecimal(outputRows) / BigDecimal(rightRows)
    -        } else {
    -          BigDecimal(0)
    -        }
             val oldColStat = oldAttrStats(a)
             val oldNdv = oldColStat.distinctCount
    -        // We only change (scale down) the number of distinct values if the number of rows
    -        // decreases after join, because join won't produce new values even if the number of
    -        // rows increases.
    -        val newNdv = if (join.left.outputSet.contains(a) && leftRatio < 1) {
    -          ceil(BigDecimal(oldNdv) * leftRatio)
    -        } else if (join.right.outputSet.contains(a) && rightRatio < 1) {
    -          ceil(BigDecimal(oldNdv) * rightRatio)
    +        val newNdv = if (join.left.outputSet.contains(a)) {
    +          updateNdv(oldNumRows = leftRows, newNumRows = outputRows, oldNdv = oldNdv)
             } else {
    -          oldNdv
    +          updateNdv(oldNumRows = rightRows, newNumRows = outputRows, oldNdv = oldNdv)
             }
    +        val newColStat =
    --- End diff --
    
    Yes, please look at [this line](https://github.com/apache/spark/pull/17918/files#diff-e068b2e4d8b82a9587450cd17d8d7226R791). I don't know why it is not folded.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17918: [SPARK-20678][SQL] Ndv for columns not in filter ...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17918#discussion_r115672685
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala ---
    @@ -421,7 +420,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
       test("cdouble < 3.0") {
         validateEstimatedStats(
           Filter(LessThan(attrDouble, Literal(3.0)), childStatsTestPlan(Seq(attrDouble), 10L)),
    -      Seq(attrDouble -> ColumnStat(distinctCount = 2, min = Some(1.0), max = Some(3.0),
    +      Seq(attrDouble -> ColumnStat(distinctCount = 3, min = Some(1.0), max = Some(3.0),
    --- End diff --
    
    No, it's because previously ndv is computed using HALF_UP rounding mode, while row count is computed using CEILING mode. I just make them consistent (both using CEILING), as in PR description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    **[Test build #76664 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76664/testReport)** for PR 17918 at commit [`39b72ca`](https://github.com/apache/spark/commit/39b72caff37dc3da1e50beccfbe58f56cedf8007).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17918: [SPARK-20678][SQL] Ndv for columns not in filter ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17918#discussion_r115478819
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala ---
    @@ -217,32 +217,18 @@ case class InnerOuterEstimation(conf: SQLConf, join: Join) extends Logging {
           if (joinKeyStats.contains(a)) {
             outputAttrStats += a -> joinKeyStats(a)
           } else {
    -        val leftRatio = if (leftRows != 0) {
    -          BigDecimal(outputRows) / BigDecimal(leftRows)
    -        } else {
    -          BigDecimal(0)
    -        }
    -        val rightRatio = if (rightRows != 0) {
    -          BigDecimal(outputRows) / BigDecimal(rightRows)
    -        } else {
    -          BigDecimal(0)
    -        }
             val oldColStat = oldAttrStats(a)
             val oldNdv = oldColStat.distinctCount
    -        // We only change (scale down) the number of distinct values if the number of rows
    -        // decreases after join, because join won't produce new values even if the number of
    -        // rows increases.
    -        val newNdv = if (join.left.outputSet.contains(a) && leftRatio < 1) {
    -          ceil(BigDecimal(oldNdv) * leftRatio)
    -        } else if (join.right.outputSet.contains(a) && rightRatio < 1) {
    -          ceil(BigDecimal(oldNdv) * rightRatio)
    +        val newNdv = if (join.left.outputSet.contains(a)) {
    +          updateNdv(oldNumRows = leftRows, newNumRows = outputRows, oldNdv = oldNdv)
             } else {
    -          oldNdv
    +          updateNdv(oldNumRows = rightRows, newNumRows = outputRows, oldNdv = oldNdv)
             }
    +        val newColStat =
    --- End diff --
    
    nit: `val newColStat = oldColStat.copy(distinctCount = newNdv)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76664/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17918: [SPARK-20678][SQL] Ndv for columns not in filter ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17918#discussion_r115478033
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -750,24 +739,59 @@ case class FilterEstimation(plan: Filter, catalystConf: SQLConf) extends Logging
           }
         }
     
    -    Some(percent.toDouble)
    +    Some(percent)
       }
     
     }
     
    -class ColumnStatsMap {
    -  private val baseMap: mutable.Map[ExprId, (Attribute, ColumnStat)] = mutable.HashMap.empty
    +/**
    + * This class contains the original column stats from child, and maintains the updated column stats.
    + * We will update the corresponding ColumnStats for a column after we apply a predicate condition.
    + * For example, column c has [min, max] value as [0, 100].  In a range condition such as
    + * (c > 40 AND c <= 50), we need to set the column's [min, max] value to [40, 100] after we
    + * evaluate the first condition c > 40. We also need to set the column's [min, max] value to
    + * [40, 50] after we evaluate the second condition c <= 50.
    + *
    + * @param originalMap Original column stats from child.
    + */
    +case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) {
     
    -  def setInitValues(colStats: AttributeMap[ColumnStat]): Unit = {
    -    baseMap.clear()
    -    baseMap ++= colStats.baseMap
    -  }
    +  /** This map maintains the latest column stats. */
    +  private val updatedMap: mutable.Map[ExprId, (Attribute, ColumnStat)] = mutable.HashMap.empty
     
    -  def contains(a: Attribute): Boolean = baseMap.contains(a.exprId)
    +  def contains(a: Attribute): Boolean = updatedMap.contains(a.exprId) || originalMap.contains(a)
     
    -  def apply(a: Attribute): ColumnStat = baseMap(a.exprId)._2
    +  /**
    +   * Gets column stat for the given attribute. Prefer the column stat in updatedMap than that in
    +   * originalMap, because updatedMap has the latest (updated) column stats.
    +   */
    +  def apply(a: Attribute): ColumnStat = {
    +    if (updatedMap.contains(a.exprId)) {
    +      updatedMap(a.exprId)._2
    +    } else {
    +      originalMap(a)
    +    }
    +  }
     
    -  def update(a: Attribute, stats: ColumnStat): Unit = baseMap.update(a.exprId, a -> stats)
    +  /** Updates column stats in updatedMap. */
    +  def update(a: Attribute, stats: ColumnStat): Unit = updatedMap.update(a.exprId, a -> stats)
     
    -  def toColumnStats: AttributeMap[ColumnStat] = AttributeMap(baseMap.values.toSeq)
    +  /**
    +   * Collects updated column stats, and scales down ndv for other column stats if the number of rows
    +   * decreases after this Filter operator.
    +   */
    +  def outputColumnStats(rowsBeforeFilter: BigInt, rowsAfterFilter: BigInt)
    +    : AttributeMap[ColumnStat] = {
    +    val newColumnStats = originalMap.map { case (attr, oriColStat) =>
    +      // Update ndv based on the overall filter selectivity: scale down ndv if the number of rows
    +      // decreases; otherwise keep it unchanged.
    +      val newNdv = EstimationUtils.updateNdv(oldNumRows = rowsBeforeFilter,
    +        newNumRows = rowsAfterFilter, oldNdv = oriColStat.distinctCount)
    +      val colStat = if (updatedMap.contains(attr.exprId)) updatedMap(attr.exprId)._2 else oriColStat
    --- End diff --
    
    ```
    val colStat = updatedMap.get(attr.exprId).map(_._2).getOrElse(oriColStat)
    attr -> colStat.copy(distinctCount = newNdv)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17918: [SPARK-20678][SQL] Ndv for columns not in filter ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17918#discussion_r115663707
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala ---
    @@ -217,32 +217,18 @@ case class InnerOuterEstimation(conf: SQLConf, join: Join) extends Logging {
           if (joinKeyStats.contains(a)) {
             outputAttrStats += a -> joinKeyStats(a)
           } else {
    -        val leftRatio = if (leftRows != 0) {
    -          BigDecimal(outputRows) / BigDecimal(leftRows)
    -        } else {
    -          BigDecimal(0)
    -        }
    -        val rightRatio = if (rightRows != 0) {
    -          BigDecimal(outputRows) / BigDecimal(rightRows)
    -        } else {
    -          BigDecimal(0)
    -        }
             val oldColStat = oldAttrStats(a)
             val oldNdv = oldColStat.distinctCount
    -        // We only change (scale down) the number of distinct values if the number of rows
    -        // decreases after join, because join won't produce new values even if the number of
    -        // rows increases.
    -        val newNdv = if (join.left.outputSet.contains(a) && leftRatio < 1) {
    -          ceil(BigDecimal(oldNdv) * leftRatio)
    -        } else if (join.right.outputSet.contains(a) && rightRatio < 1) {
    -          ceil(BigDecimal(oldNdv) * rightRatio)
    +        val newNdv = if (join.left.outputSet.contains(a)) {
    +          updateNdv(oldNumRows = leftRows, newNumRows = outputRows, oldNdv = oldNdv)
             } else {
    -          oldNdv
    +          updateNdv(oldNumRows = rightRows, newNumRows = outputRows, oldNdv = oldNdv)
             }
    +        val newColStat =
    --- End diff --
    
    is it fixed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17918: [SPARK-20678][SQL] Ndv for columns not in filter ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17918#discussion_r115478974
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala ---
    @@ -217,32 +217,18 @@ case class InnerOuterEstimation(conf: SQLConf, join: Join) extends Logging {
           if (joinKeyStats.contains(a)) {
             outputAttrStats += a -> joinKeyStats(a)
           } else {
    -        val leftRatio = if (leftRows != 0) {
    -          BigDecimal(outputRows) / BigDecimal(leftRows)
    -        } else {
    -          BigDecimal(0)
    -        }
    -        val rightRatio = if (rightRows != 0) {
    -          BigDecimal(outputRows) / BigDecimal(rightRows)
    -        } else {
    -          BigDecimal(0)
    -        }
             val oldColStat = oldAttrStats(a)
             val oldNdv = oldColStat.distinctCount
    -        // We only change (scale down) the number of distinct values if the number of rows
    -        // decreases after join, because join won't produce new values even if the number of
    -        // rows increases.
    -        val newNdv = if (join.left.outputSet.contains(a) && leftRatio < 1) {
    -          ceil(BigDecimal(oldNdv) * leftRatio)
    -        } else if (join.right.outputSet.contains(a) && rightRatio < 1) {
    -          ceil(BigDecimal(oldNdv) * rightRatio)
    +        val newNdv = if (join.left.outputSet.contains(a)) {
    +          updateNdv(oldNumRows = leftRows, newNumRows = outputRows, oldNdv = oldNdv)
             } else {
    -          oldNdv
    +          updateNdv(oldNumRows = rightRows, newNumRows = outputRows, oldNdv = oldNdv)
             }
    +        val newColStat =
    --- End diff --
    
    I don't think adding an extra `if` to avoid unnecessary copy has performance benefits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17918: [SPARK-20678][SQL] Ndv for columns not in filter ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17918#discussion_r115671957
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala ---
    @@ -421,7 +420,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
       test("cdouble < 3.0") {
         validateEstimatedStats(
           Filter(LessThan(attrDouble, Literal(3.0)), childStatsTestPlan(Seq(attrDouble), 10L)),
    -      Seq(attrDouble -> ColumnStat(distinctCount = 2, min = Some(1.0), max = Some(3.0),
    +      Seq(attrDouble -> ColumnStat(distinctCount = 3, min = Some(1.0), max = Some(3.0),
    --- End diff --
    
    these changes are because the filter selectivity is now decimal instead of double?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17918: [SPARK-20678][SQL] Ndv for columns not in filter ...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17918#discussion_r115664182
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala ---
    @@ -217,32 +217,18 @@ case class InnerOuterEstimation(conf: SQLConf, join: Join) extends Logging {
           if (joinKeyStats.contains(a)) {
             outputAttrStats += a -> joinKeyStats(a)
           } else {
    -        val leftRatio = if (leftRows != 0) {
    -          BigDecimal(outputRows) / BigDecimal(leftRows)
    -        } else {
    -          BigDecimal(0)
    -        }
    -        val rightRatio = if (rightRows != 0) {
    -          BigDecimal(outputRows) / BigDecimal(rightRows)
    -        } else {
    -          BigDecimal(0)
    -        }
             val oldColStat = oldAttrStats(a)
             val oldNdv = oldColStat.distinctCount
    -        // We only change (scale down) the number of distinct values if the number of rows
    -        // decreases after join, because join won't produce new values even if the number of
    -        // rows increases.
    -        val newNdv = if (join.left.outputSet.contains(a) && leftRatio < 1) {
    -          ceil(BigDecimal(oldNdv) * leftRatio)
    -        } else if (join.right.outputSet.contains(a) && rightRatio < 1) {
    -          ceil(BigDecimal(oldNdv) * rightRatio)
    +        val newNdv = if (join.left.outputSet.contains(a)) {
    +          updateNdv(oldNumRows = leftRows, newNumRows = outputRows, oldNdv = oldNdv)
             } else {
    -          oldNdv
    +          updateNdv(oldNumRows = rightRows, newNumRows = outputRows, oldNdv = oldNdv)
             }
    +        val newColStat =
    --- End diff --
    
    Oh, I looked at the wrong file. I'll fix it now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    cc @cloud-fan  @gatorsmile @ron8hu 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17918: [SPARK-20678][SQL] Ndv for columns not in filter ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17918


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17918: [SPARK-20678][SQL] Ndv for columns not in filter ...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17918#discussion_r115489131
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala ---
    @@ -217,32 +217,18 @@ case class InnerOuterEstimation(conf: SQLConf, join: Join) extends Logging {
           if (joinKeyStats.contains(a)) {
             outputAttrStats += a -> joinKeyStats(a)
           } else {
    -        val leftRatio = if (leftRows != 0) {
    -          BigDecimal(outputRows) / BigDecimal(leftRows)
    -        } else {
    -          BigDecimal(0)
    -        }
    -        val rightRatio = if (rightRows != 0) {
    -          BigDecimal(outputRows) / BigDecimal(rightRows)
    -        } else {
    -          BigDecimal(0)
    -        }
             val oldColStat = oldAttrStats(a)
             val oldNdv = oldColStat.distinctCount
    -        // We only change (scale down) the number of distinct values if the number of rows
    -        // decreases after join, because join won't produce new values even if the number of
    -        // rows increases.
    -        val newNdv = if (join.left.outputSet.contains(a) && leftRatio < 1) {
    -          ceil(BigDecimal(oldNdv) * leftRatio)
    -        } else if (join.right.outputSet.contains(a) && rightRatio < 1) {
    -          ceil(BigDecimal(oldNdv) * rightRatio)
    +        val newNdv = if (join.left.outputSet.contains(a)) {
    +          updateNdv(oldNumRows = leftRows, newNumRows = outputRows, oldNdv = oldNdv)
             } else {
    -          oldNdv
    +          updateNdv(oldNumRows = rightRows, newNumRows = outputRows, oldNdv = oldNdv)
             }
    +        val newColStat =
    --- End diff --
    
    OK, fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76683/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    **[Test build #76683 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76683/testReport)** for PR 17918 at commit [`2289422`](https://github.com/apache/spark/commit/2289422d20fe4d8012bce7ef26d21c29b2a0635f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    **[Test build #76664 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76664/testReport)** for PR 17918 at commit [`39b72ca`](https://github.com/apache/spark/commit/39b72caff37dc3da1e50beccfbe58f56cedf8007).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    **[Test build #76731 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76731/testReport)** for PR 17918 at commit [`d4ab730`](https://github.com/apache/spark/commit/d4ab7307f4fc51ea91f5617b88b4aaa14a85cbab).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    **[Test build #76731 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76731/testReport)** for PR 17918 at commit [`d4ab730`](https://github.com/apache/spark/commit/d4ab7307f4fc51ea91f5617b88b4aaa14a85cbab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    @ron8hu Yes, but I forgot the query numbers. The case is, first we filter a large part of table A, then join table B on another column. Since ndv of that column is not scaled down, the estimated cardinality of the join is too small.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    thanks, merging to master/2.2!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    **[Test build #76683 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76683/testReport)** for PR 17918 at commit [`2289422`](https://github.com/apache/spark/commit/2289422d20fe4d8012bce7ef26d21c29b2a0635f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17918: [SPARK-20678][SQL] Ndv for columns not in filter conditi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17918
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76731/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org