You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ron8hu <gi...@git.apache.org> on 2017/11/19 19:59:33 UTC

[GitHub] spark pull request #19783: support histogram in filter cardinality estimatio...

GitHub user ron8hu opened a pull request:

    https://github.com/apache/spark/pull/19783

    support histogram in filter cardinality estimation

    ## What changes were proposed in this pull request?
    
    Histogram is effective in dealing with skewed distribution. After we generate histogram information for column statistics, we need to adjust filter estimation based on histogram data structure.
    
    ## How was this patch tested?
    
    We revised all the unit test cases by including histogram data structure.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ron8hu/spark supportHistogram

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19783.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19783
    
----
commit dd5b975dafdf9fc4edd94cf6e369f5e899db74e2
Author: Ron Hu <ro...@huawei.com>
Date:   2017-11-19T19:37:47Z

    support histogram in filter cardinality estimation

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155231547
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154254419
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala ---
    @@ -359,7 +371,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
       test("cbool > false") {
         validateEstimatedStats(
           Filter(GreaterThan(attrBool, Literal(false)), childStatsTestPlan(Seq(attrBool), 10L)),
    -      Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max = Some(true),
    +      Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max = Some(true),
    --- End diff --
    
    My earlier comment mentioned this test case.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r151890602
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala ---
    @@ -158,8 +196,8 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
         val condition = Not(And(LessThan(attrInt, Literal(3)), Literal(null, IntegerType)))
         validateEstimatedStats(
           Filter(condition, childStatsTestPlan(Seq(attrInt), 10L)),
    -      Seq(attrInt -> colStatInt.copy(distinctCount = 8)),
    -      expectedRowCount = 8)
    +      Seq(attrInt -> colStatInt.copy(distinctCount = 7)),
    --- End diff --
    
    Shall we add new test cases for filter estimation based on histogram, instead of modifying existing test results?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154848018
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,194 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    --- End diff --
    
    Why is this method so different from `findFirstBinForValue`? It looks like we just need to reverse the iteration order, i.e. from `bins.length` to `0`. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84003/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155125029
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,194 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    --- End diff --
    
    Good point.  Actually while loop is better because it can exit early when the condition no longer qualifies.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155692778
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala ---
    @@ -359,7 +371,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
       test("cbool > false") {
         validateEstimatedStats(
           Filter(GreaterThan(attrBool, Literal(false)), childStatsTestPlan(Seq(attrBool), 10L)),
    -      Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max = Some(true),
    +      Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max = Some(true),
    --- End diff --
    
    That may need special code path for boolean type, but IMHO I don't think it deserves the complexity.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153977299
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    --- End diff --
    
    for robustness, add `assert(bins.head.lo <= value && bins.last.hi >= value)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155560015
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,144 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the last bin into which a column value falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -= 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param bin a single histogram bin
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      higherValue: Double,
    +      lowerValue: Double,
    +      bin: HistogramBin): Double = {
    +    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    if (bin.hi == bin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / bin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * This is an overloaded method. The column value distribution is saved in an
    +   * equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of bins for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationBins(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram): Double = {
    +    if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin)
    +    } else {
    +      // compute how much lowerEnd/higherEnd occupies its bin
    +      val lowerCurBin = histogram.bins(lowerId)
    +      val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
    +
    +      val higherCurBin = histogram.bins(higherId)
    +      val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin)
    +
    +      // the total length is lowerPart + higherPart + bins between them
    +      lowerPart + higherPart + higherId - lowerId - 1
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   * The column value distribution is saved in an equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationNdv(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram)
    +    : Long = {
    +    val ndv: Double = if (higherEnd == lowerEnd) {
    +      1
    +    } else if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin) * curBin.ndv
    +    } else {
    +      // compute how much the [lowerEnd, higherEnd] range occupies the bins in a histogram.
    +      // Our computation has 3 parts: the smallest/min bin, the middle bins, the largest/max bin.
    +      val minCurBin = histogram.bins(lowerId)
    +      val minPartNdv = getOccupation(minCurBin.hi, lowerEnd, minCurBin) * minCurBin.ndv
    +
    +      val maxCurBin = histogram.bins(higherId)
    +      val maxPartNdv = getOccupation(higherEnd, maxCurBin.lo, maxCurBin) * maxCurBin.ndv
    +
    +      // The total ndv is minPartNdv + maxPartNdv + Ndvs between them.
    +      // In order to avoid counting same distinct value twice, we check if the upperBound value
    --- End diff --
    
    ah i see, it's possible for skewed data that some bins just represent one literal.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84732 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84732/testReport)** for PR 19783 at commit [`be1e7ba`](https://github.com/apache/spark/commit/be1e7ba1ae485da523232e2ddffce8ac911392dd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19783


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155560228
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,144 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the last bin into which a column value falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -= 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param bin a single histogram bin
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      higherValue: Double,
    +      lowerValue: Double,
    +      bin: HistogramBin): Double = {
    +    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    if (bin.hi == bin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / bin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * This is an overloaded method. The column value distribution is saved in an
    +   * equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of bins for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationBins(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram): Double = {
    +    if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin)
    +    } else {
    +      // compute how much lowerEnd/higherEnd occupies its bin
    +      val lowerCurBin = histogram.bins(lowerId)
    +      val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
    +
    +      val higherCurBin = histogram.bins(higherId)
    +      val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin)
    +
    +      // the total length is lowerPart + higherPart + bins between them
    +      lowerPart + higherPart + higherId - lowerId - 1
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   * The column value distribution is saved in an equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationNdv(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram)
    +    : Long = {
    +    val ndv: Double = if (higherEnd == lowerEnd) {
    +      1
    +    } else if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin) * curBin.ndv
    +    } else {
    +      // compute how much the [lowerEnd, higherEnd] range occupies the bins in a histogram.
    +      // Our computation has 3 parts: the smallest/min bin, the middle bins, the largest/max bin.
    +      val minCurBin = histogram.bins(lowerId)
    +      val minPartNdv = getOccupation(minCurBin.hi, lowerEnd, minCurBin) * minCurBin.ndv
    +
    +      val maxCurBin = histogram.bins(higherId)
    +      val maxPartNdv = getOccupation(higherEnd, maxCurBin.lo, maxCurBin) * maxCurBin.ndv
    +
    +      // The total ndv is minPartNdv + maxPartNdv + Ndvs between them.
    +      // In order to avoid counting same distinct value twice, we check if the upperBound value
    +      // of next bin is equal to the hi value of the previous bin.  We bump up
    +      // ndv value only if the hi values of two consecutive bins are different.
    +      var middleNdv: Long = 0
    +      for (i <- histogram.bins.indices) {
    --- End diff --
    
    again this is a typical while loop pattern.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153978787
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    --- End diff --
    
    could you remove the empty line between method comment and its definition?
    same for other methods here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153977915
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    histogram.bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +
    +  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    for (i <- 0 until histogram.bins.length) {
    +      if (value > histogram.bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) {
    +        if (value == histogram.bins(i + 1).lo) {
    --- End diff --
    
    merge two `if`s: if ((value == histogram.bins(i).hi) && (value == histogram.bins(i + 1).lo) && (i < histogram.bins.length - 1))


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84365 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84365/testReport)** for PR 19783 at commit [`6e6c49b`](https://github.com/apache/spark/commit/6e6c49bdfef0071f18b6e9b607b7fc1bf5b49ff8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84517/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84236 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84236/testReport)** for PR 19783 at commit [`052d111`](https://github.com/apache/spark/commit/052d11159478da563bd9b514b4267b28ba3347f9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154848277
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,194 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    for (i <- bins.indices) {
    +      if (value > bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == bins(i).hi) && (i < bins.length - 1) && (value == bins(i + 1).lo)) {
    +        // We assume the above 3 conditions will be evaluated from left to right sequentially.
    +        // If the above 3 conditions are evaluated out-of-order, then out-of-bound error may happen.
    +        // At that time, we should split the third condition into another if statement.
    +        // increment binId since the value appears in this bin and next bin
    +        binId += 1
    +      }
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    --- End diff --
    
    instead of accepting a `binId` and `histogram`, can't we just ask the caller side to pass a `HistogramBin`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153902382
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin/bucket into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin/bucket into which a column values falls.
    +   */
    +
    +  def findFirstBucketForValue(value: Double, histogram: Histogram): Int = {
    --- End diff --
    
    We had bucket(s) and bin(s) used interchangeably.  To avoid confusion, I will unify them to use only bin/bins.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153972336
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      // We compute filter selectivity using Histogram information
    +      attr.dataType match {
    +        case StringType | BinaryType =>
    +          Some(1.0 / BigDecimal(ndv))
    +
    +        case _ =>
    +          // returns 1/ndv if there is no histogram
    +          if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv))
    +
    +          // We traverse histogram bins to locate the literal value
    +          val hgmBins = colStat.histogram.get.bins
    +          val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    +          // find the interval where this datum locates
    --- End diff --
    
    we can remove this comment, it's explained above


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84365 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84365/testReport)** for PR 19783 at commit [`6e6c49b`](https://github.com/apache/spark/commit/6e6c49bdfef0071f18b6e9b607b7fc1bf5b49ff8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84317 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84317/testReport)** for PR 19783 at commit [`241089c`](https://github.com/apache/spark/commit/241089cebadd74d189e80af6a0f4d87817a10faa).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r156281223
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,99 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the last bin into which a column value falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -= 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param bin a single histogram bin
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    --- End diff --
    
    redundant


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84317 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84317/testReport)** for PR 19783 at commit [`241089c`](https://github.com/apache/spark/commit/241089cebadd74d189e80af6a0f4d87817a10faa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155560861
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,144 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the last bin into which a column value falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -= 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param bin a single histogram bin
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      higherValue: Double,
    +      lowerValue: Double,
    +      bin: HistogramBin): Double = {
    +    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    if (bin.hi == bin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / bin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * This is an overloaded method. The column value distribution is saved in an
    +   * equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of bins for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationBins(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram): Double = {
    +    if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin)
    +    } else {
    +      // compute how much lowerEnd/higherEnd occupies its bin
    +      val lowerCurBin = histogram.bins(lowerId)
    +      val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
    +
    +      val higherCurBin = histogram.bins(higherId)
    +      val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin)
    +
    +      // the total length is lowerPart + higherPart + bins between them
    +      lowerPart + higherPart + higherId - lowerId - 1
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   * The column value distribution is saved in an equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationNdv(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram)
    +    : Long = {
    +    val ndv: Double = if (higherEnd == lowerEnd) {
    +      1
    +    } else if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin) * curBin.ndv
    +    } else {
    +      // compute how much the [lowerEnd, higherEnd] range occupies the bins in a histogram.
    +      // Our computation has 3 parts: the smallest/min bin, the middle bins, the largest/max bin.
    +      val minCurBin = histogram.bins(lowerId)
    +      val minPartNdv = getOccupation(minCurBin.hi, lowerEnd, minCurBin) * minCurBin.ndv
    +
    +      val maxCurBin = histogram.bins(higherId)
    +      val maxPartNdv = getOccupation(higherEnd, maxCurBin.lo, maxCurBin) * maxCurBin.ndv
    +
    +      // The total ndv is minPartNdv + maxPartNdv + Ndvs between them.
    +      // In order to avoid counting same distinct value twice, we check if the upperBound value
    +      // of next bin is equal to the hi value of the previous bin.  We bump up
    +      // ndv value only if the hi values of two consecutive bins are different.
    --- End diff --
    
    this doesn't match the code, the actual logic is: `only if the lo and hi values of the bin are different`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153975698
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -784,11 +879,16 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) {
       def outputColumnStats(rowsBeforeFilter: BigInt, rowsAfterFilter: BigInt)
         : AttributeMap[ColumnStat] = {
         val newColumnStats = originalMap.map { case (attr, oriColStat) =>
    -      // Update ndv based on the overall filter selectivity: scale down ndv if the number of rows
    -      // decreases; otherwise keep it unchanged.
    -      val newNdv = EstimationUtils.updateNdv(oldNumRows = rowsBeforeFilter,
    -        newNumRows = rowsAfterFilter, oldNdv = oriColStat.distinctCount)
           val colStat = updatedMap.get(attr.exprId).map(_._2).getOrElse(oriColStat)
    +      val newNdv = if (colStat.distinctCount > 1) {
    --- End diff --
    
    no need to add extra check here, in `EstimationUtils.updateNdv` we already check this case. We can just revert the change here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84751/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153976879
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -529,6 +575,55 @@ case class FilterEstimation(plan: Filter) extends Logging {
         Some(percent)
       }
     
    +  /**
    +   * Returns the selectivity percentage for a combined op-dataNumber in the column's
    +   * current valid range [min, max]
    +   *
    +   * @param op a binary comparison operator
    +   * @param histogram a numeric equi-height histogram
    +   * @param max the upper bound of the current valid range for a given column
    +   * @param min the lower bound of the current valid range for a given column
    +   * @param datumNumber the numeric value of a literal
    +   * @return the selectivity percentage for a condition in the current range.
    +   */
    +
    +  def computePercentForNumericEquiHeightHgm(
    --- End diff --
    
    `computePercentByEquiHeightHgm` or just `computePercentByHistogram`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r156281155
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,99 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153979438
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    histogram.bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +
    +  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    for (i <- 0 until histogram.bins.length) {
    +      if (value > histogram.bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) {
    +        if (value == histogram.bins(i + 1).lo) {
    +          // increment binId since the value appears into this bin and next bin
    +          binId += 1
    +        }
    +      }
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (binId == 0 && curBin.hi == curBin.lo) {
    +      // the Min of the histogram occupies the whole first bin
    +      1.0
    +    } else if (binId == 0 && curBin.hi != curBin.lo) {
    +      if (higherValue == lowerValue) {
    +        // in the case curBin.binNdv == 0, current bin is occupied by one value, which
    --- End diff --
    
    binNdv will never be zero


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155233875
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (curBin.hi == curBin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / curBin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    --- End diff --
    
    `Returns the number of bins...` why the return type is double?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153973342
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends Logging {
           percent = 1.0
         } else {
    --- End diff --
    
    `else if (colStat.histogram.isEmpty)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155571394
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala ---
    @@ -359,7 +371,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
       test("cbool > false") {
         validateEstimatedStats(
           Filter(GreaterThan(attrBool, Literal(false)), childStatsTestPlan(Seq(attrBool), 10L)),
    -      Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max = Some(true),
    +      Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max = Some(true),
    --- End diff --
    
    Logically this is wrong, although it's not a big deal to break this case. Is there any way we can fix it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r151979117
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala ---
    @@ -158,8 +196,8 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
         val condition = Not(And(LessThan(attrInt, Literal(3)), Literal(null, IntegerType)))
         validateEstimatedStats(
           Filter(condition, childStatsTestPlan(Seq(attrInt), 10L)),
    -      Seq(attrInt -> colStatInt.copy(distinctCount = 8)),
    -      expectedRowCount = 8)
    +      Seq(attrInt -> colStatInt.copy(distinctCount = 7)),
    --- End diff --
    
    +1


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153970902
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      // We compute filter selectivity using Histogram information
    --- End diff --
    
    move this comment where the histogram computation really starts


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154254145
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -784,11 +879,16 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) {
       def outputColumnStats(rowsBeforeFilter: BigInt, rowsAfterFilter: BigInt)
         : AttributeMap[ColumnStat] = {
         val newColumnStats = originalMap.map { case (attr, oriColStat) =>
    -      // Update ndv based on the overall filter selectivity: scale down ndv if the number of rows
    -      // decreases; otherwise keep it unchanged.
    -      val newNdv = EstimationUtils.updateNdv(oldNumRows = rowsBeforeFilter,
    -        newNumRows = rowsAfterFilter, oldNdv = oriColStat.distinctCount)
           val colStat = updatedMap.get(attr.exprId).map(_._2).getOrElse(oriColStat)
    +      val newNdv = if (colStat.distinctCount > 1) {
    --- End diff --
    
    The old code does not work well for a couple of new skewed-distribution tests.  For example, test("cintHgm < 3") would fail.  Because it still computes to find newNdv in updateNdv() method.  But, in reality, we already scale it down to 1.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r156282033
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,44 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      if (colStat.histogram.isEmpty) {
    +        // returns 1/ndv if there is no histogram
    +        Some(1.0 / BigDecimal(ndv))
    +      } else {
    +        // We compute filter selectivity using Histogram information.
    +        val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    +        val histogram = colStat.histogram.get
    +        val hgmBins = histogram.bins
    +
    +        // find bins where column's current min and max locate.  Note that a column's [min, max]
    +        // range may change due to another condition applied earlier.
    +        val min = EstimationUtils.toDecimal(colStat.min.get, literal.dataType).toDouble
    +        val max = EstimationUtils.toDecimal(colStat.max.get, literal.dataType).toDouble
    +        val minBinId = EstimationUtils.findFirstBinForValue(min, hgmBins)
    --- End diff --
    
    nit: `minBinIndex`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154252063
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -513,10 +560,9 @@ case class FilterEstimation(plan: Filter) extends Logging {
     
             op match {
               case _: GreaterThan | _: GreaterThanOrEqual =>
    -            // If new ndv is 1, then new max must be equal to new min.
    -            newMin = if (newNdv == 1) newMax else newValue
    +            newMin = newValue
               case _: LessThan | _: LessThanOrEqual =>
    -            newMax = if (newNdv == 1) newMin else newValue
    +            newMax = newValue
    --- End diff --
    
    Previously I coded that way because of a corner test case: test("cbool > false").  At that time, I set the newMin to newMax since newNdv = 1.  However, this logic does not work well for the skewed distribution test case: test ("cintHgm < 3").  In this test, newMin=1 newMax=3.  I think the revised code makes better sense.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155233351
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (curBin.hi == curBin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / curBin.ndv.toDouble
    --- End diff --
    
    shouldn't we check the `lowerValue/higherValues` fits in the bin value range?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155343962
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (curBin.hi == curBin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / curBin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    --- End diff --
    
    This is because we may return a percentage of a bin.  For example, a predicate column=5 may return the number of bins 0.2 if the holding bin has 5 distinct values.  Hence, we cannot return an integer type value.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155564302
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      if (colStat.histogram.isEmpty) {
    +        // returns 1/ndv if there is no histogram
    +        Some(1.0 / BigDecimal(ndv))
    +      } else {
    +        // We compute filter selectivity using Histogram information.
    +        // Here we traverse histogram bins to locate the range of bins the literal values falls
    +        // into.  For skewed distribution, a literal value can occupy multiple bins.
    +        val hgmBins = colStat.histogram.get.bins
    +        val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    +        var lowerId, higherId = -1
    +        for (i <- hgmBins.indices) {
    +          // if datum > upperBound, just traverse to next bin
    +          if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i
    +          if (higherId < 0) {
    +            if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) ||
    +              ((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) {
    +              higherId = i
    +            }
    +           }
    --- End diff --
    
    how about 
    ```
    var lowerId = -1
    var highIdFound = false
    var i = 0
    while (i < hgmBins.length || highIdFound) {
      if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i
      if (datum >= hgmBins(i).lo) highIdFound = true
    }
    val highId = i
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84190 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84190/testReport)** for PR 19783 at commit [`8e5d04e`](https://github.com/apache/spark/commit/8e5d04ef4687917b2487e5b9267bc40ba3ef33a1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153973312
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends Logging {
           percent = 1.0
         } else {
           // This is the partial overlap case:
    -      // Without advanced statistics like histogram, we assume uniform data distribution.
    -      // We just prorate the adjusted range over the initial range to compute filter selectivity.
    -      assert(max > min)
    -      percent = op match {
    -        case _: LessThan =>
    -          if (numericLiteral == max) {
    -            // If the literal value is right on the boundary, we can minus the part of the
    -            // boundary value (1/ndv).
    -            1.0 - 1.0 / ndv
    -          } else {
    -            (numericLiteral - min) / (max - min)
    -          }
    -        case _: LessThanOrEqual =>
    -          if (numericLiteral == min) {
    -            // The boundary value is the only satisfying value.
    -            1.0 / ndv
    -          } else {
    -            (numericLiteral - min) / (max - min)
    -          }
    -        case _: GreaterThan =>
    -          if (numericLiteral == min) {
    -            1.0 - 1.0 / ndv
    -          } else {
    -            (max - numericLiteral) / (max - min)
    -          }
    -        case _: GreaterThanOrEqual =>
    -          if (numericLiteral == max) {
    -            1.0 / ndv
    -          } else {
    -            (max - numericLiteral) / (max - min)
    -          }
    +
    +      if (colStat.histogram.isEmpty) {
    --- End diff --
    
    move this `if` to upper level, then we can reduce code diff


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155558275
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,144 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the last bin into which a column value falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -= 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param bin a single histogram bin
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      higherValue: Double,
    +      lowerValue: Double,
    +      bin: HistogramBin): Double = {
    +    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    if (bin.hi == bin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / bin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * This is an overloaded method. The column value distribution is saved in an
    +   * equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of bins for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationBins(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram): Double = {
    +    if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin)
    +    } else {
    +      // compute how much lowerEnd/higherEnd occupies its bin
    +      val lowerCurBin = histogram.bins(lowerId)
    +      val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
    +
    +      val higherCurBin = histogram.bins(higherId)
    +      val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin)
    +
    +      // the total length is lowerPart + higherPart + bins between them
    +      lowerPart + higherPart + higherId - lowerId - 1
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   * The column value distribution is saved in an equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationNdv(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram)
    +    : Long = {
    --- End diff --
    
    nit: move this to the previous line


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155231456
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r156281132
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,99 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    --- End diff --
    
    Seems this is redundant, shall we remove it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153976662
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    histogram.bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +
    +  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    for (i <- 0 until histogram.bins.length) {
    --- End diff --
    
    i <- bins.indices


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153973474
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -513,10 +560,9 @@ case class FilterEstimation(plan: Filter) extends Logging {
     
             op match {
               case _: GreaterThan | _: GreaterThanOrEqual =>
    -            // If new ndv is 1, then new max must be equal to new min.
    -            newMin = if (newNdv == 1) newMax else newValue
    +            newMin = newValue
               case _: LessThan | _: LessThanOrEqual =>
    -            newMax = if (newNdv == 1) newMin else newValue
    +            newMax = newValue
    --- End diff --
    
    why change these two line?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84573 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84573/testReport)** for PR 19783 at commit [`c9538b8`](https://github.com/apache/spark/commit/c9538b8410f4210dea1e8fcda3f09539f95b1b38).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r156281162
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,99 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the last bin into which a column value falls.
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153972517
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      // We compute filter selectivity using Histogram information
    +      attr.dataType match {
    +        case StringType | BinaryType =>
    +          Some(1.0 / BigDecimal(ndv))
    +
    +        case _ =>
    +          // returns 1/ndv if there is no histogram
    +          if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv))
    +
    +          // We traverse histogram bins to locate the literal value
    +          val hgmBins = colStat.histogram.get.bins
    +          val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    +          // find the interval where this datum locates
    +          var lowerId, higherId = -1
    +          for (i <- hgmBins.indices) {
    +            // if datum > upperBound, just move to next bin
    --- End diff --
    
    please remove the comment, it does not match the logic at next line (there's no "move" logic)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84385/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84317/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153974257
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala ---
    @@ -578,6 +590,112 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
           expectedRowCount = 5)
       }
     
    +  // The following test cases have histogram information collected for the test column
    +  test("Not(cintHgm < 3 AND null)") {
    +    val condition = Not(And(LessThan(attrIntHgm, Literal(3)), Literal(null, IntegerType)))
    +    validateEstimatedStats(
    +      Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> colStatIntHgm.copy(distinctCount = 6)),
    +      expectedRowCount = 9)
    +  }
    +
    +  test("cintHgm = 5") {
    +    validateEstimatedStats(
    +      Filter(EqualTo(attrIntHgm, Literal(5)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(5), max = Some(5),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 4)
    +  }
    +
    +  test("cintHgm = 0") {
    +    // This is an out-of-range case since 0 is outside the range [min, max]
    +    validateEstimatedStats(
    +      Filter(EqualTo(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Nil,
    +      expectedRowCount = 0)
    +  }
    +
    +  test("cintHgm < 3") {
    +    validateEstimatedStats(
    +      Filter(LessThan(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 2)
    +  }
    +
    +  test("cintHgm < 0") {
    +    // This is a corner case since literal 0 is smaller than min.
    +    validateEstimatedStats(
    +      Filter(LessThan(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Nil,
    +      expectedRowCount = 0)
    +  }
    +
    +  test("cintHgm <= 3") {
    +    validateEstimatedStats(
    +      Filter(LessThanOrEqual(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 2)
    +  }
    +
    +  test("cintHgm > 6") {
    +    validateEstimatedStats(
    +      Filter(GreaterThan(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 2, min = Some(6), max = Some(10),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 2)
    +  }
    +
    +  test("cintHgm > 10") {
    +    // This is a corner case since max value is 10.
    +    validateEstimatedStats(
    +      Filter(GreaterThan(attrIntHgm, Literal(10)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Nil,
    +      expectedRowCount = 0)
    +  }
    +
    +  test("cintHgm >= 6") {
    +    validateEstimatedStats(
    +      Filter(GreaterThanOrEqual(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 3, min = Some(6), max = Some(10),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 4)
    +  }
    +
    +  test("cintHgm IS NULL") {
    +    validateEstimatedStats(
    +      Filter(IsNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Nil,
    +      expectedRowCount = 0)
    +  }
    +
    +  test("cintHgm IS NOT NULL") {
    +    validateEstimatedStats(
    +      Filter(IsNotNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 6, min = Some(1), max = Some(10),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 10)
    +  }
    +
    +  test("cintHgm > 3 AND cintHgm <= 6") {
    +    val condition = And(GreaterThan(attrIntHgm,
    +      Literal(3)), LessThanOrEqual(attrIntHgm, Literal(6)))
    +    validateEstimatedStats(
    +      Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 5, min = Some(3), max = Some(6),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 8)
    +  }
    +
    +  test("cintHgm = 3 OR cintHgm = 6") {
    --- End diff --
    
    I think we don't need test cases for combination conditions like AND, OR, NOT, because histogram doesn't affect estimation logic for them. Instead, we need to test more cases for histogram, e.g. =, >=, >, <=, <, and for different distributions, e.g. skewed distribution and non-skewed distribution.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154223705
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    --- End diff --
    
    I hesitate to add an assert statement here.  This is because an assert such as this may cause Spark system to crash if a user does not fresh his data statistics quickly.  In real world, a user may load data, collect statistics, and then add more incremental data, but does not collect statistics immediately.  He may issue a SQL query against his newly added data such as "WHERE column=xxx", where xxx is a new value in his incremental load.  After all, statistics are auxiliary, a query should still run even the statistics are not up to date.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154847802
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,194 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    --- End diff --
    
    this looks more like a while loop pattern, can we use while loop here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153972295
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      // We compute filter selectivity using Histogram information
    +      attr.dataType match {
    +        case StringType | BinaryType =>
    +          Some(1.0 / BigDecimal(ndv))
    +
    +        case _ =>
    +          // returns 1/ndv if there is no histogram
    +          if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv))
    +
    +          // We traverse histogram bins to locate the literal value
    --- End diff --
    
    This comment is not accurate, here we want to get the bins occupied by the literal value, because if the value is skewed, it can occupy multiple bins.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84365/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84385/testReport)** for PR 19783 at commit [`9d2a463`](https://github.com/apache/spark/commit/9d2a463f76ecd164dd15a834aa971a4d222113b6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84236/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84700/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155125157
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,194 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    --- End diff --
    
    Good point.  We can simplify the logic by iterating from bins.length-1 to 0.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155691788
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -529,6 +570,56 @@ case class FilterEstimation(plan: Filter) extends Logging {
         Some(percent)
       }
     
    +  /**
    +   * Returns the selectivity percentage for binary condition in the column's
    +   * current valid range [min, max]
    +   *
    +   * @param op a binary comparison operator
    +   * @param histogram a numeric equi-height histogram
    +   * @param max the upper bound of the current valid range for a given column
    +   * @param min the lower bound of the current valid range for a given column
    +   * @param datumNumber the numeric value of a literal
    +   * @return the selectivity percentage for a condition in the current range.
    +   */
    +
    +  def computePercentByEquiHeightHgm(
    +      op: BinaryComparison,
    +      histogram: Histogram,
    +      max: Double,
    +      min: Double,
    +      datumNumber: Double): Double = {
    +    // find bins where column's current min and max locate.  Note that a column's [min, max]
    +    // range may change due to another condition applied earlier.
    +    val minBinId = EstimationUtils.findFirstBinForValue(min, histogram.bins)
    +    val maxBinId = EstimationUtils.findLastBinForValue(max, histogram.bins)
    +    assert(minBinId <= maxBinId)
    +
    +    // compute how many bins the column's current valid range [min, max] occupies.
    +    // Note that a column's [min, max] range may vary after we apply some filter conditions.
    +    val minToMaxLength = EstimationUtils.getOccupationBins(maxBinId, minBinId, max,
    --- End diff --
    
    Personally I prefer to have this method unit-tested, because it's the core part of filter estimation. We can do this in follow-up anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155559543
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      if (colStat.histogram.isEmpty) {
    +        // returns 1/ndv if there is no histogram
    +        Some(1.0 / BigDecimal(ndv))
    +      } else {
    +        // We compute filter selectivity using Histogram information.
    --- End diff --
    
    better to move these to a new method.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153686107
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin/bucket into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin/bucket into which a column values falls.
    +   */
    +
    +  def findFirstBucketForValue(value: Double, histogram: Histogram): Int = {
    --- End diff --
    
    Shall we unify all names to `bin`/`bins` in code and comments?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153973544
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala ---
    @@ -359,7 +371,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
       test("cbool > false") {
         validateEstimatedStats(
           Filter(GreaterThan(attrBool, Literal(false)), childStatsTestPlan(Seq(attrBool), 10L)),
    -      Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max = Some(true),
    +      Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max = Some(true),
    --- End diff --
    
    why the result is changed?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153974940
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends Logging {
           percent = 1.0
         } else {
           // This is the partial overlap case:
    -      // Without advanced statistics like histogram, we assume uniform data distribution.
    -      // We just prorate the adjusted range over the initial range to compute filter selectivity.
    -      assert(max > min)
    -      percent = op match {
    -        case _: LessThan =>
    -          if (numericLiteral == max) {
    -            // If the literal value is right on the boundary, we can minus the part of the
    -            // boundary value (1/ndv).
    -            1.0 - 1.0 / ndv
    -          } else {
    -            (numericLiteral - min) / (max - min)
    -          }
    -        case _: LessThanOrEqual =>
    -          if (numericLiteral == min) {
    -            // The boundary value is the only satisfying value.
    -            1.0 / ndv
    -          } else {
    -            (numericLiteral - min) / (max - min)
    -          }
    -        case _: GreaterThan =>
    -          if (numericLiteral == min) {
    -            1.0 - 1.0 / ndv
    -          } else {
    -            (max - numericLiteral) / (max - min)
    -          }
    -        case _: GreaterThanOrEqual =>
    -          if (numericLiteral == max) {
    -            1.0 / ndv
    -          } else {
    -            (max - numericLiteral) / (max - min)
    -          }
    +
    +      if (colStat.histogram.isEmpty) {
    +        // Without advanced statistics like histogram, we assume uniform data distribution.
    +        // We just prorate the adjusted range over the initial range to compute filter selectivity.
    +        assert(max > min)
    +        percent = op match {
    +          case _: LessThan =>
    +            if (numericLiteral == max) {
    +              // If the literal value is right on the boundary, we can minus the part of the
    +              // boundary value (1/ndv).
    +              1.0 - 1.0 / ndv
    +            } else {
    +              (numericLiteral - min) / (max - min)
    +            }
    +          case _: LessThanOrEqual =>
    +            if (numericLiteral == min) {
    +              // The boundary value is the only satisfying value.
    +              1.0 / ndv
    +            } else {
    +              (numericLiteral - min) / (max - min)
    +            }
    +          case _: GreaterThan =>
    +            if (numericLiteral == min) {
    +              1.0 - 1.0 / ndv
    +            } else {
    +              (max - numericLiteral) / (max - min)
    +            }
    +          case _: GreaterThanOrEqual =>
    +            if (numericLiteral == max) {
    +              1.0 / ndv
    +            } else {
    +              (max - numericLiteral) / (max - min)
    +            }
    +        }
    +      } else {
    +        val numericHistogram = colStat.histogram.get
    +        val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    +        val maxDouble = EstimationUtils.toDecimal(colStat.max.get, literal.dataType).toDouble
    +        val minDouble = EstimationUtils.toDecimal(colStat.min.get, literal.dataType).toDouble
    --- End diff --
    
    `max, min` is good enough I think


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r156282461
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,44 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      if (colStat.histogram.isEmpty) {
    +        // returns 1/ndv if there is no histogram
    +        Some(1.0 / BigDecimal(ndv))
    +      } else {
    +        // We compute filter selectivity using Histogram information.
    +        val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    +        val histogram = colStat.histogram.get
    +        val hgmBins = histogram.bins
    +
    +        // find bins where column's current min and max locate.  Note that a column's [min, max]
    +        // range may change due to another condition applied earlier.
    +        val min = EstimationUtils.toDecimal(colStat.min.get, literal.dataType).toDouble
    +        val max = EstimationUtils.toDecimal(colStat.max.get, literal.dataType).toDouble
    +        val minBinId = EstimationUtils.findFirstBinForValue(min, hgmBins)
    +        val maxBinId = EstimationUtils.findLastBinForValue(max, hgmBins)
    +
    +        // compute how many bins the column's current valid range [min, max] occupies.
    +        // Note that a column's [min, max] range may vary after we apply some filter conditions.
    +        val validRangeBins = EstimationUtils.getOccupationBins(maxBinId, minBinId, max,
    +          min, histogram)
    +
    +        val lowerBinId = EstimationUtils.findFirstBinForValue(datum, hgmBins)
    +        val higherBinId = EstimationUtils.findLastBinForValue(datum, hgmBins)
    +        assert(lowerBinId <= higherBinId)
    +        val lowerBinNdv = hgmBins(lowerBinId).ndv
    +        val higherBinNdv = hgmBins(higherBinId).ndv
    +        // assume uniform distribution in each bin
    +        val occupiedBins = if (lowerBinId == higherBinId) {
    --- End diff --
    
    is this just `EstimationUtils.getOccupationBins(higherBinId, lowerBinId, datum, datum, histogram)`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155963740
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends Logging {
           percent = 1.0
         } else {
           // This is the partial overlap case:
    -      // Without advanced statistics like histogram, we assume uniform data distribution.
    -      // We just prorate the adjusted range over the initial range to compute filter selectivity.
    -      assert(max > min)
    -      percent = op match {
    -        case _: LessThan =>
    -          if (numericLiteral == max) {
    -            // If the literal value is right on the boundary, we can minus the part of the
    -            // boundary value (1/ndv).
    -            1.0 - 1.0 / ndv
    -          } else {
    -            (numericLiteral - min) / (max - min)
    -          }
    -        case _: LessThanOrEqual =>
    -          if (numericLiteral == min) {
    -            // The boundary value is the only satisfying value.
    -            1.0 / ndv
    -          } else {
    -            (numericLiteral - min) / (max - min)
    -          }
    -        case _: GreaterThan =>
    -          if (numericLiteral == min) {
    -            1.0 - 1.0 / ndv
    -          } else {
    -            (max - numericLiteral) / (max - min)
    -          }
    -        case _: GreaterThanOrEqual =>
    -          if (numericLiteral == max) {
    -            1.0 / ndv
    -          } else {
    -            (max - numericLiteral) / (max - min)
    -          }
    +
    +      if (colStat.histogram.isEmpty) {
    --- End diff --
    
    We cannot move the if statement to upper level.  This is because, for the partial overlap case, we need to update the the [min, max] range for a given column.  For the no-overlap and complete-overlap cases, we do  not need to do so.  I think the current code is modular for this reason.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153979575
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    histogram.bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +
    +  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    for (i <- 0 until histogram.bins.length) {
    +      if (value > histogram.bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) {
    +        if (value == histogram.bins(i + 1).lo) {
    +          // increment binId since the value appears into this bin and next bin
    +          binId += 1
    +        }
    +      }
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (binId == 0 && curBin.hi == curBin.lo) {
    +      // the Min of the histogram occupies the whole first bin
    +      1.0
    +    } else if (binId == 0 && curBin.hi != curBin.lo) {
    +      if (higherValue == lowerValue) {
    +        // in the case curBin.binNdv == 0, current bin is occupied by one value, which
    +        // is included in the previous bin
    +        1.0 / math.max(curBin.ndv.toDouble, 1)
    +      } else {
    +        (higherValue - lowerValue) / (curBin.hi - curBin.lo)
    +      }
    +    } else {
    +      if (curBin.hi == curBin.lo) {
    +        // the entire bin is covered in the range
    +        1.0
    +      } else if (higherValue == lowerValue) {
    +        // the literal value falls in this bin
    +        1.0 / math.max(curBin.ndv.toDouble, 1)
    +      } else {
    +        // Use proration since the range falls inside this bin.
    +        math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * The column value distribution is saved in an equi-height histogram.
    +   *
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the selectivity percentage for column values in [lowerValue, higherValue].
    +   */
    +
    +  def getOccupationBins(
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram): Double = {
    +    // find bins where current min and max locate
    +    val minBinId = findFirstBinForValue(lowerEnd, histogram)
    +    val maxBinId = findLastBinForValue(higherEnd, histogram)
    +    assert(minBinId <= maxBinId)
    +
    +    // compute how much current [min, max] occupy the histogram, in the number of bins
    +    getOccupationBins(maxBinId, minBinId, higherEnd, lowerEnd, histogram)
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * This is an overloaded method. The column value distribution is saved in an
    +   * equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the selectivity percentage for column values in [lowerEnd, higherEnd].
    +   */
    +
    +  def getOccupationBins(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram): Double = {
    +    if (lowerId == higherId) {
    +      getOccupation(lowerId, higherEnd, lowerEnd, histogram)
    +    } else {
    +      // compute how much lowerEnd/higherEnd occupy its bin
    +      val lowercurBin = histogram.bins(lowerId)
    +      val lowerPart = getOccupation(lowerId, lowercurBin.hi, lowerEnd, histogram)
    +
    +      // in case higherId > lowerId, higherId must be > 0
    +      val highercurBin = histogram.bins(higherId)
    +      val higherPart = getOccupation(higherId, higherEnd, highercurBin.lo,
    +        histogram)
    +      // the total length is lowerPart + higherPart + bins between them
    +      higherId - lowerId - 1 + lowerPart + higherPart
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   * The column value distribution is saved in an equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   */
    +
    +  def getOccupationNdv(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram)
    +    : Long = {
    +    val ndv: Double = if (higherEnd == lowerEnd) {
    +      1
    +    } else if (lowerId == higherId) {
    +      getOccupation(lowerId, higherEnd, lowerEnd, histogram) * histogram.bins(lowerId).ndv
    +    } else {
    +      // compute how much lowerEnd/higherEnd occupy its bin
    --- End diff --
    
    typo: occupies its bin


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154249995
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    histogram.bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +
    +  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    for (i <- 0 until histogram.bins.length) {
    +      if (value > histogram.bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) {
    +        if (value == histogram.bins(i + 1).lo) {
    --- End diff --
    
    No.  I meant the upper bound for the array of bins in a histogram.  The default length of the histogram bin array is 254.  When i is equal to 253 (the last bin), then i+1 is 254 leading to out-of-bound error.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153977791
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    histogram.bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +
    +  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    --- End diff --
    
    add assert here too


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84700 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84700/testReport)** for PR 19783 at commit [`be1e7ba`](https://github.com/apache/spark/commit/be1e7ba1ae485da523232e2ddffce8ac911392dd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155683828
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,144 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the last bin into which a column value falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -= 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param bin a single histogram bin
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      higherValue: Double,
    +      lowerValue: Double,
    +      bin: HistogramBin): Double = {
    +    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    if (bin.hi == bin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / bin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * This is an overloaded method. The column value distribution is saved in an
    +   * equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of bins for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationBins(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram): Double = {
    +    if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin)
    +    } else {
    +      // compute how much lowerEnd/higherEnd occupies its bin
    +      val lowerCurBin = histogram.bins(lowerId)
    +      val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
    +
    +      val higherCurBin = histogram.bins(higherId)
    +      val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin)
    +
    +      // the total length is lowerPart + higherPart + bins between them
    +      lowerPart + higherPart + higherId - lowerId - 1
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   * The column value distribution is saved in an equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationNdv(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram)
    +    : Long = {
    +    val ndv: Double = if (higherEnd == lowerEnd) {
    +      1
    +    } else if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin) * curBin.ndv
    +    } else {
    +      // compute how much the [lowerEnd, higherEnd] range occupies the bins in a histogram.
    +      // Our computation has 3 parts: the smallest/min bin, the middle bins, the largest/max bin.
    +      val minCurBin = histogram.bins(lowerId)
    +      val minPartNdv = getOccupation(minCurBin.hi, lowerEnd, minCurBin) * minCurBin.ndv
    +
    +      val maxCurBin = histogram.bins(higherId)
    +      val maxPartNdv = getOccupation(higherEnd, maxCurBin.lo, maxCurBin) * maxCurBin.ndv
    +
    +      // The total ndv is minPartNdv + maxPartNdv + Ndvs between them.
    +      // In order to avoid counting same distinct value twice, we check if the upperBound value
    +      // of next bin is equal to the hi value of the previous bin.  We bump up
    +      // ndv value only if the hi values of two consecutive bins are different.
    --- End diff --
    
    I will change the comment so that it matches with the code.  Actually my original comment means the same thing as your comment.  This is because the hi value of a bin is equal to the lo value of the next bin.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r156281044
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,99 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    --- End diff --
    
    nit: `number` -> `index`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154847703
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,194 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    --- End diff --
    
    shall we use binary search here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84517/testReport)** for PR 19783 at commit [`d068888`](https://github.com/apache/spark/commit/d068888d172897dde096c40ce70ad2584bbefc03).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155963622
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      if (colStat.histogram.isEmpty) {
    +        // returns 1/ndv if there is no histogram
    +        Some(1.0 / BigDecimal(ndv))
    +      } else {
    +        // We compute filter selectivity using Histogram information.
    --- End diff --
    
    O.  will do.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84751 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84751/testReport)** for PR 19783 at commit [`be1e7ba`](https://github.com/apache/spark/commit/be1e7ba1ae485da523232e2ddffce8ac911392dd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154250197
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      // We compute filter selectivity using Histogram information
    +      attr.dataType match {
    +        case StringType | BinaryType =>
    +          Some(1.0 / BigDecimal(ndv))
    +
    +        case _ =>
    +          // returns 1/ndv if there is no histogram
    +          if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv))
    +
    +          // We traverse histogram bins to locate the literal value
    +          val hgmBins = colStat.histogram.get.bins
    +          val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    +          // find the interval where this datum locates
    +          var lowerId, higherId = -1
    +          for (i <- hgmBins.indices) {
    +            // if datum > upperBound, just move to next bin
    +            if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i
    +            if (higherId < 0) {
    +              if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) ||
    +                ((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) {
    +                higherId = i
    +              }
    +            }
    +          }
    +          assert(lowerId <= higherId)
    +          val lowerBinNdv = hgmBins(lowerId).ndv
    +          val higherBinNdv = hgmBins(higherId).ndv
    +          // assume uniform distribution in each bin
    +          val percent = if (lowerId == higherId) {
    +            (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1)
    +          } else {
    +            1.0 / hgmBins.length * (higherId - lowerId - 1) +
    +              (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) +
    +              (1.0 / hgmBins.length) / math.max(higherBinNdv, 1)
    +          }
    +          Some(percent)
    --- End diff --
    
    Good point. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84517/testReport)** for PR 19783 at commit [`d068888`](https://github.com/apache/spark/commit/d068888d172897dde096c40ce70ad2584bbefc03).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153970815
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      // We compute filter selectivity using Histogram information
    +      attr.dataType match {
    --- End diff --
    
    use ` if (colStat.histogram.isEmpty)` to seperate the logic of basic stats (`Some(1.0 / BigDecimal(ndv))`) and histogram computation.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r156281583
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,99 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the last bin into which a column value falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -= 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param bin a single histogram bin
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      higherValue: Double,
    +      lowerValue: Double,
    +      bin: HistogramBin): Double = {
    +    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    if (bin.hi == bin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / bin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * The column value distribution is saved in an equi-height histogram.  The return values is a
    +   * double value is because we may return a portion of a bin. For example, a predicate
    +   * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    --- End diff --
    
    nit: `higherIndex`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84236 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84236/testReport)** for PR 19783 at commit [`052d111`](https://github.com/apache/spark/commit/052d11159478da563bd9b514b4267b28ba3347f9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155561210
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      if (colStat.histogram.isEmpty) {
    +        // returns 1/ndv if there is no histogram
    +        Some(1.0 / BigDecimal(ndv))
    +      } else {
    +        // We compute filter selectivity using Histogram information.
    +        // Here we traverse histogram bins to locate the range of bins the literal values falls
    +        // into.  For skewed distribution, a literal value can occupy multiple bins.
    +        val hgmBins = colStat.histogram.get.bins
    +        val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    --- End diff --
    
    cc @wzhfy , you would refactor this part to always use Double for CBO computing?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154223769
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    histogram.bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +
    +  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    --- End diff --
    
    same comment as in my last reply.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155231792
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    --- End diff --
    
    `a percentage of a bin...` do you mean `percentage of bins`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84573 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84573/testReport)** for PR 19783 at commit [`c9538b8`](https://github.com/apache/spark/commit/c9538b8410f4210dea1e8fcda3f09539f95b1b38).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84190 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84190/testReport)** for PR 19783 at commit [`8e5d04e`](https://github.com/apache/spark/commit/8e5d04ef4687917b2487e5b9267bc40ba3ef33a1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153973781
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala ---
    @@ -578,6 +590,112 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
           expectedRowCount = 5)
       }
     
    +  // The following test cases have histogram information collected for the test column
    +  test("Not(cintHgm < 3 AND null)") {
    +    val condition = Not(And(LessThan(attrIntHgm, Literal(3)), Literal(null, IntegerType)))
    +    validateEstimatedStats(
    +      Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> colStatIntHgm.copy(distinctCount = 6)),
    +      expectedRowCount = 9)
    +  }
    +
    +  test("cintHgm = 5") {
    +    validateEstimatedStats(
    +      Filter(EqualTo(attrIntHgm, Literal(5)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(5), max = Some(5),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 4)
    +  }
    +
    +  test("cintHgm = 0") {
    +    // This is an out-of-range case since 0 is outside the range [min, max]
    +    validateEstimatedStats(
    +      Filter(EqualTo(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Nil,
    +      expectedRowCount = 0)
    +  }
    +
    +  test("cintHgm < 3") {
    +    validateEstimatedStats(
    +      Filter(LessThan(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 2)
    +  }
    +
    +  test("cintHgm < 0") {
    +    // This is a corner case since literal 0 is smaller than min.
    +    validateEstimatedStats(
    +      Filter(LessThan(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Nil,
    +      expectedRowCount = 0)
    +  }
    +
    +  test("cintHgm <= 3") {
    +    validateEstimatedStats(
    +      Filter(LessThanOrEqual(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 2)
    +  }
    +
    +  test("cintHgm > 6") {
    +    validateEstimatedStats(
    +      Filter(GreaterThan(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 2, min = Some(6), max = Some(10),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 2)
    +  }
    +
    +  test("cintHgm > 10") {
    +    // This is a corner case since max value is 10.
    +    validateEstimatedStats(
    +      Filter(GreaterThan(attrIntHgm, Literal(10)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Nil,
    +      expectedRowCount = 0)
    +  }
    +
    +  test("cintHgm >= 6") {
    +    validateEstimatedStats(
    +      Filter(GreaterThanOrEqual(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 3, min = Some(6), max = Some(10),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 4)
    +  }
    +
    +  test("cintHgm IS NULL") {
    +    validateEstimatedStats(
    +      Filter(IsNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Nil,
    +      expectedRowCount = 0)
    +  }
    +
    +  test("cintHgm IS NOT NULL") {
    --- End diff --
    
    histogram does not affect estimation logic for `IS NULL` and `IS NOT NULL` filter conditions, we can remove these two test cases.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154255068
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala ---
    @@ -578,6 +590,112 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
           expectedRowCount = 5)
       }
     
    +  // The following test cases have histogram information collected for the test column
    +  test("Not(cintHgm < 3 AND null)") {
    +    val condition = Not(And(LessThan(attrIntHgm, Literal(3)), Literal(null, IntegerType)))
    +    validateEstimatedStats(
    +      Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> colStatIntHgm.copy(distinctCount = 6)),
    +      expectedRowCount = 9)
    +  }
    +
    +  test("cintHgm = 5") {
    +    validateEstimatedStats(
    +      Filter(EqualTo(attrIntHgm, Literal(5)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(5), max = Some(5),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 4)
    +  }
    +
    +  test("cintHgm = 0") {
    +    // This is an out-of-range case since 0 is outside the range [min, max]
    +    validateEstimatedStats(
    +      Filter(EqualTo(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Nil,
    +      expectedRowCount = 0)
    +  }
    +
    +  test("cintHgm < 3") {
    +    validateEstimatedStats(
    +      Filter(LessThan(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 2)
    +  }
    +
    +  test("cintHgm < 0") {
    +    // This is a corner case since literal 0 is smaller than min.
    +    validateEstimatedStats(
    +      Filter(LessThan(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Nil,
    +      expectedRowCount = 0)
    +  }
    +
    +  test("cintHgm <= 3") {
    +    validateEstimatedStats(
    +      Filter(LessThanOrEqual(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 2)
    +  }
    +
    +  test("cintHgm > 6") {
    +    validateEstimatedStats(
    +      Filter(GreaterThan(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 2, min = Some(6), max = Some(10),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 2)
    +  }
    +
    +  test("cintHgm > 10") {
    +    // This is a corner case since max value is 10.
    +    validateEstimatedStats(
    +      Filter(GreaterThan(attrIntHgm, Literal(10)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Nil,
    +      expectedRowCount = 0)
    +  }
    +
    +  test("cintHgm >= 6") {
    +    validateEstimatedStats(
    +      Filter(GreaterThanOrEqual(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 3, min = Some(6), max = Some(10),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 4)
    +  }
    +
    +  test("cintHgm IS NULL") {
    +    validateEstimatedStats(
    +      Filter(IsNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Nil,
    +      expectedRowCount = 0)
    +  }
    +
    +  test("cintHgm IS NOT NULL") {
    +    validateEstimatedStats(
    +      Filter(IsNotNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 6, min = Some(1), max = Some(10),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 10)
    +  }
    +
    +  test("cintHgm > 3 AND cintHgm <= 6") {
    +    val condition = And(GreaterThan(attrIntHgm,
    +      Literal(3)), LessThanOrEqual(attrIntHgm, Literal(6)))
    +    validateEstimatedStats(
    +      Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)),
    +      Seq(attrIntHgm -> ColumnStat(distinctCount = 5, min = Some(3), max = Some(6),
    +        nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
    +      expectedRowCount = 8)
    +  }
    +
    +  test("cintHgm = 3 OR cintHgm = 6") {
    --- End diff --
    
    We have added histogram test cases for skewed distribution.  I will add more histogram test cases for non-skewed distribution.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154270654
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    histogram.bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +
    +  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    for (i <- 0 until histogram.bins.length) {
    +      if (value > histogram.bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) {
    +        if (value == histogram.bins(i + 1).lo) {
    --- End diff --
    
    just move this condition after the length check: 
    ```
    if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1) && (value == histogram.bins(i + 1).lo))
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155232397
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    --- End diff --
    
    the method signature looks weird, shouldn't it be
    ```
    private def getOccupation(
      higherValue: Double,
      lowerValue: Double,
      bin: HistogramBin)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155233734
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (curBin.hi == curBin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / curBin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
    --- End diff --
    
    shouldn't check the overlapping percentage? Do I miss some assumption for this method?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84700 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84700/testReport)** for PR 19783 at commit [`be1e7ba`](https://github.com/apache/spark/commit/be1e7ba1ae485da523232e2ddffce8ac911392dd).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154248775
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    histogram.bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +
    +  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    for (i <- 0 until histogram.bins.length) {
    +      if (value > histogram.bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) {
    +        if (value == histogram.bins(i + 1).lo) {
    --- End diff --
    
    By "out of bound", do you mean it exceeds 100 length limit? You can just switch new line after `&&`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84725/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84751 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84751/testReport)** for PR 19783 at commit [`be1e7ba`](https://github.com/apache/spark/commit/be1e7ba1ae485da523232e2ddffce8ac911392dd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153977529
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    histogram.bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +
    +  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    for (i <- 0 until histogram.bins.length) {
    +      if (value > histogram.bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) {
    +        if (value == histogram.bins(i + 1).lo) {
    +          // increment binId since the value appears into this bin and next bin
    --- End diff --
    
    appears in both this bin and the next bin


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155233099
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (curBin.hi == curBin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    --- End diff --
    
    I don't get it, shouldn't we check `lowerValue <= curBin.lo <= higherValue` here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155690722
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      if (colStat.histogram.isEmpty) {
    +        // returns 1/ndv if there is no histogram
    +        Some(1.0 / BigDecimal(ndv))
    +      } else {
    +        // We compute filter selectivity using Histogram information.
    +        // Here we traverse histogram bins to locate the range of bins the literal values falls
    +        // into.  For skewed distribution, a literal value can occupy multiple bins.
    +        val hgmBins = colStat.histogram.get.bins
    +        val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    --- End diff --
    
    yes, I'll refactor this part.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84732/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153972847
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      // We compute filter selectivity using Histogram information
    +      attr.dataType match {
    +        case StringType | BinaryType =>
    +          Some(1.0 / BigDecimal(ndv))
    +
    +        case _ =>
    +          // returns 1/ndv if there is no histogram
    +          if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv))
    +
    +          // We traverse histogram bins to locate the literal value
    +          val hgmBins = colStat.histogram.get.bins
    +          val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    +          // find the interval where this datum locates
    +          var lowerId, higherId = -1
    +          for (i <- hgmBins.indices) {
    +            // if datum > upperBound, just move to next bin
    +            if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i
    +            if (higherId < 0) {
    +              if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) ||
    +                ((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) {
    +                higherId = i
    +              }
    +            }
    +          }
    +          assert(lowerId <= higherId)
    +          val lowerBinNdv = hgmBins(lowerId).ndv
    +          val higherBinNdv = hgmBins(higherId).ndv
    +          // assume uniform distribution in each bin
    +          val percent = if (lowerId == higherId) {
    +            (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1)
    +          } else {
    +            1.0 / hgmBins.length * (higherId - lowerId - 1) +
    +              (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) +
    +              (1.0 / hgmBins.length) / math.max(higherBinNdv, 1)
    +          }
    +          Some(percent)
    --- End diff --
    
    How about simplifying the above logic as:
    ```
    val occupiedBins = if (lowerId == higherId) {
      1.0 / lowerBinNdv
    } else {
      (higherId - lowerId - 1) + 1.0 / lowerBinNdv + 1.0 / higherBinNdv
    }
    Some(occupiedBins / hgmBins.length)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155231264
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    --- End diff --
    
    nit: `the id of the first bin into which the given value falls.`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r156281918
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      if (colStat.histogram.isEmpty) {
    +        // returns 1/ndv if there is no histogram
    +        Some(1.0 / BigDecimal(ndv))
    +      } else {
    +        // We compute filter selectivity using Histogram information.
    --- End diff --
    
    did you create a new method?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155231523
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153979157
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    histogram.bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +
    +  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    for (i <- 0 until histogram.bins.length) {
    +      if (value > histogram.bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) {
    +        if (value == histogram.bins(i + 1).lo) {
    +          // increment binId since the value appears into this bin and next bin
    +          binId += 1
    +        }
    +      }
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (binId == 0 && curBin.hi == curBin.lo) {
    +      // the Min of the histogram occupies the whole first bin
    +      1.0
    +    } else if (binId == 0 && curBin.hi != curBin.lo) {
    +      if (higherValue == lowerValue) {
    +        // in the case curBin.binNdv == 0, current bin is occupied by one value, which
    +        // is included in the previous bin
    +        1.0 / math.max(curBin.ndv.toDouble, 1)
    +      } else {
    +        (higherValue - lowerValue) / (curBin.hi - curBin.lo)
    +      }
    +    } else {
    +      if (curBin.hi == curBin.lo) {
    +        // the entire bin is covered in the range
    +        1.0
    +      } else if (higherValue == lowerValue) {
    +        // the literal value falls in this bin
    +        1.0 / math.max(curBin.ndv.toDouble, 1)
    +      } else {
    +        // Use proration since the range falls inside this bin.
    +        math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * The column value distribution is saved in an equi-height histogram.
    +   *
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the selectivity percentage for column values in [lowerValue, higherValue].
    +   */
    +
    +  def getOccupationBins(
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram): Double = {
    +    // find bins where current min and max locate
    +    val minBinId = findFirstBinForValue(lowerEnd, histogram)
    +    val maxBinId = findLastBinForValue(higherEnd, histogram)
    --- End diff --
    
    how about `lowerBinId, higherBinId`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84190/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r156281819
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,99 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the last bin into which a column value falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -= 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param bin a single histogram bin
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      higherValue: Double,
    +      lowerValue: Double,
    +      bin: HistogramBin): Double = {
    +    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    if (bin.hi == bin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / bin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * The column value distribution is saved in an equi-height histogram.  The return values is a
    +   * double value is because we may return a portion of a bin. For example, a predicate
    +   * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of bins for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationBins(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram): Double = {
    +    assert(lowerId <= higherId)
    +
    +    if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin)
    +    } else {
    +      // compute how much lowerEnd/higherEnd occupies its bin
    +      val lowerCurBin = histogram.bins(lowerId)
    +      val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
    --- End diff --
    
    shall we assert that `lowerBin.lo <= lowerEnd`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155560523
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,144 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the last bin into which a column value falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -= 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param bin a single histogram bin
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      higherValue: Double,
    +      lowerValue: Double,
    +      bin: HistogramBin): Double = {
    +    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    if (bin.hi == bin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / bin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * This is an overloaded method. The column value distribution is saved in an
    +   * equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of bins for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationBins(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram): Double = {
    +    if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin)
    +    } else {
    +      // compute how much lowerEnd/higherEnd occupies its bin
    +      val lowerCurBin = histogram.bins(lowerId)
    +      val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
    +
    +      val higherCurBin = histogram.bins(higherId)
    +      val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin)
    +
    +      // the total length is lowerPart + higherPart + bins between them
    +      lowerPart + higherPart + higherId - lowerId - 1
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   * The column value distribution is saved in an equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationNdv(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram)
    +    : Long = {
    +    val ndv: Double = if (higherEnd == lowerEnd) {
    +      1
    +    } else if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin) * curBin.ndv
    +    } else {
    +      // compute how much the [lowerEnd, higherEnd] range occupies the bins in a histogram.
    +      // Our computation has 3 parts: the smallest/min bin, the middle bins, the largest/max bin.
    +      val minCurBin = histogram.bins(lowerId)
    +      val minPartNdv = getOccupation(minCurBin.hi, lowerEnd, minCurBin) * minCurBin.ndv
    +
    +      val maxCurBin = histogram.bins(higherId)
    +      val maxPartNdv = getOccupation(higherEnd, maxCurBin.lo, maxCurBin) * maxCurBin.ndv
    +
    +      // The total ndv is minPartNdv + maxPartNdv + Ndvs between them.
    +      // In order to avoid counting same distinct value twice, we check if the upperBound value
    +      // of next bin is equal to the hi value of the previous bin.  We bump up
    +      // ndv value only if the hi values of two consecutive bins are different.
    +      var middleNdv: Long = 0
    +      for (i <- histogram.bins.indices) {
    +        val bin = histogram.bins(i)
    +        if (bin.hi != bin.lo && i >= lowerId + 1 && i <= higherId - 1) {
    --- End diff --
    
    ```
    var i = lowerId + 1
    while (i < higherId) {
      ...
      i += 1
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155562073
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      if (colStat.histogram.isEmpty) {
    +        // returns 1/ndv if there is no histogram
    +        Some(1.0 / BigDecimal(ndv))
    +      } else {
    +        // We compute filter selectivity using Histogram information.
    +        // Here we traverse histogram bins to locate the range of bins the literal values falls
    +        // into.  For skewed distribution, a literal value can occupy multiple bins.
    +        val hgmBins = colStat.histogram.get.bins
    +        val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    +        var lowerId, higherId = -1
    +        for (i <- hgmBins.indices) {
    --- End diff --
    
    ditto, while loop here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84003 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84003/testReport)** for PR 19783 at commit [`dd5b975`](https://github.com/apache/spark/commit/dd5b975dafdf9fc4edd94cf6e369f5e899db74e2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84385/testReport)** for PR 19783 at commit [`9d2a463`](https://github.com/apache/spark/commit/9d2a463f76ecd164dd15a834aa971a4d222113b6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84003 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84003/testReport)** for PR 19783 at commit [`dd5b975`](https://github.com/apache/spark/commit/dd5b975dafdf9fc4edd94cf6e369f5e899db74e2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84725 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84725/testReport)** for PR 19783 at commit [`be1e7ba`](https://github.com/apache/spark/commit/be1e7ba1ae485da523232e2ddffce8ac911392dd).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    LGTM overall


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154850738
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,194 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    for (i <- bins.indices) {
    +      if (value > bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == bins(i).hi) && (i < bins.length - 1) && (value == bins(i + 1).lo)) {
    +        // We assume the above 3 conditions will be evaluated from left to right sequentially.
    +        // If the above 3 conditions are evaluated out-of-order, then out-of-bound error may happen.
    +        // At that time, we should split the third condition into another if statement.
    +        // increment binId since the value appears in this bin and next bin
    +        binId += 1
    +      }
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (binId == 0 && curBin.hi == curBin.lo) {
    +      // the Min of the histogram occupies the whole first bin
    +      1.0
    +    } else if (binId == 0 && curBin.hi != curBin.lo) {
    +      if (higherValue == lowerValue) {
    +        // set percentage to 1/NDV
    +        1.0 / curBin.ndv.toDouble
    +      } else {
    +        // Use proration since the range falls inside this bin.
    +        (higherValue - lowerValue) / (curBin.hi - curBin.lo)
    +      }
    +    } else {
    +      if (curBin.hi == curBin.lo) {
    +        // the entire bin is covered in the range
    +        1.0
    +      } else if (higherValue == lowerValue) {
    +        // set percentage to 1/NDV
    +        1.0 / curBin.ndv.toDouble
    +      } else {
    +        // Use proration since the range falls inside this bin.
    +        math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * The column value distribution is saved in an equi-height histogram.
    +   *
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the selectivity percentage for column values in [lowerValue, higherValue].
    +   */
    +  def getOccupationBins(
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram): Double = {
    +    // find bins where current min and max locate
    +    val lowerBinId = findFirstBinForValue(lowerEnd, histogram.bins)
    +    val higherBinId = findLastBinForValue(higherEnd, histogram.bins)
    +    assert(lowerBinId <= higherBinId)
    +
    +    // compute how much current [lowerEnd, higherEnd] range occupies the histogram in the
    +    // number of bins
    +    getOccupationBins(higherBinId, lowerBinId, higherEnd, lowerEnd, histogram)
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * This is an overloaded method. The column value distribution is saved in an
    +   * equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the selectivity percentage for column values in [lowerEnd, higherEnd].
    --- End diff --
    
    this doesn't match the java doc: `Returns the number of bins...`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155963930
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala ---
    @@ -359,7 +371,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
       test("cbool > false") {
         validateEstimatedStats(
           Filter(GreaterThan(attrBool, Literal(false)), childStatsTestPlan(Seq(attrBool), 10L)),
    -      Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max = Some(true),
    +      Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max = Some(true),
    --- End diff --
    
    Agreed with wzhfy.  Today's logic is: for these 2 conditions, (column > x) and (column >= x), we set the min value to x.  We do not distinguish these 2 cases.  This is because we do not know the exact next value larger than x if x is a continuous data type like double type.  We may do some special coding for discrete data types such as Boolean or integer.  But, as wzhfy said, it does not deserve the complexity.   


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154848509
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,194 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    for (i <- bins.indices) {
    +      if (value > bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == bins(i).hi) && (i < bins.length - 1) && (value == bins(i + 1).lo)) {
    +        // We assume the above 3 conditions will be evaluated from left to right sequentially.
    +        // If the above 3 conditions are evaluated out-of-order, then out-of-bound error may happen.
    +        // At that time, we should split the third condition into another if statement.
    +        // increment binId since the value appears in this bin and next bin
    +        binId += 1
    +      }
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (binId == 0 && curBin.hi == curBin.lo) {
    +      // the Min of the histogram occupies the whole first bin
    +      1.0
    +    } else if (binId == 0 && curBin.hi != curBin.lo) {
    +      if (higherValue == lowerValue) {
    +        // set percentage to 1/NDV
    +        1.0 / curBin.ndv.toDouble
    +      } else {
    +        // Use proration since the range falls inside this bin.
    +        (higherValue - lowerValue) / (curBin.hi - curBin.lo)
    --- End diff --
    
    why do we need to specialize it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155964396
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (curBin.hi == curBin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / curBin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    --- End diff --
    
    OK.  Done.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155566119
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends Logging {
           percent = 1.0
         } else {
           // This is the partial overlap case:
    -      // Without advanced statistics like histogram, we assume uniform data distribution.
    -      // We just prorate the adjusted range over the initial range to compute filter selectivity.
    -      assert(max > min)
    -      percent = op match {
    -        case _: LessThan =>
    -          if (numericLiteral == max) {
    -            // If the literal value is right on the boundary, we can minus the part of the
    -            // boundary value (1/ndv).
    -            1.0 - 1.0 / ndv
    -          } else {
    -            (numericLiteral - min) / (max - min)
    -          }
    -        case _: LessThanOrEqual =>
    -          if (numericLiteral == min) {
    -            // The boundary value is the only satisfying value.
    -            1.0 / ndv
    -          } else {
    -            (numericLiteral - min) / (max - min)
    -          }
    -        case _: GreaterThan =>
    -          if (numericLiteral == min) {
    -            1.0 - 1.0 / ndv
    -          } else {
    -            (max - numericLiteral) / (max - min)
    -          }
    -        case _: GreaterThanOrEqual =>
    -          if (numericLiteral == max) {
    -            1.0 / ndv
    -          } else {
    -            (max - numericLiteral) / (max - min)
    -          }
    +
    +      if (colStat.histogram.isEmpty) {
    --- End diff --
    
    yea please do it


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154225069
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    histogram.bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +
    +  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    +    for (i <- 0 until histogram.bins.length) {
    +      if (value > histogram.bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) {
    +        if (value == histogram.bins(i + 1).lo) {
    --- End diff --
    
    I used two statements instead of one statement is because, when i points to the last bin, this condition "value == histogram.bins(i + 1).lo" may be out of bound.  By separating the conditions into two statements, we can be sure that the out-of-bound error will not happen.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153976643
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    --- End diff --
    
    we can just pass bin array as parameter, this can simplify the code.
    same for other methods.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155231340
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    --- End diff --
    
    nit: s space after `+=`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84725/testReport)** for PR 19783 at commit [`be1e7ba`](https://github.com/apache/spark/commit/be1e7ba1ae485da523232e2ddffce8ac911392dd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    For the past 2 test builds #84725 and #84732, I checked the test result on the web.  Actually there were no failures.  See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84725/testReport/.  It appears that there is a bug in the jenkins test system.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155557934
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,144 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the last bin into which a column value falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -= 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param bin a single histogram bin
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      higherValue: Double,
    +      lowerValue: Double,
    +      bin: HistogramBin): Double = {
    +    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    if (bin.hi == bin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / bin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * This is an overloaded method. The column value distribution is saved in an
    +   * equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    --- End diff --
    
    do we have an assumption between `higherEnd`/`lowerEnd` and `higherId`/`lowerId`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    **[Test build #84732 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84732/testReport)** for PR 19783 at commit [`be1e7ba`](https://github.com/apache/spark/commit/be1e7ba1ae485da523232e2ddffce8ac911392dd).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153975383
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -529,6 +575,55 @@ case class FilterEstimation(plan: Filter) extends Logging {
         Some(percent)
       }
     
    +  /**
    +   * Returns the selectivity percentage for a combined op-dataNumber in the column's
    +   * current valid range [min, max]
    --- End diff --
    
    Returns the selectivity percentage for a binary condition given the column's current value range.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155558939
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,144 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the first bin into which a column value falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i += 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column value falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the id of the last bin into which a column value falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -= 1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param bin a single histogram bin
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      higherValue: Double,
    +      lowerValue: Double,
    +      bin: HistogramBin): Double = {
    +    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    if (bin.hi == bin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / bin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    +   * This is an overloaded method. The column value distribution is saved in an
    +   * equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of bins for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationBins(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram): Double = {
    +    if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin)
    +    } else {
    +      // compute how much lowerEnd/higherEnd occupies its bin
    +      val lowerCurBin = histogram.bins(lowerId)
    +      val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
    +
    +      val higherCurBin = histogram.bins(higherId)
    +      val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin)
    +
    +      // the total length is lowerPart + higherPart + bins between them
    +      lowerPart + higherPart + higherId - lowerId - 1
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   * The column value distribution is saved in an equi-height histogram.
    +   *
    +   * @param higherId id of the high end bin holding the high end value of a column range
    +   * @param lowerId id of the low end bin holding the low end value of a column range
    +   * @param higherEnd a given upper bound value of a specified column value range
    +   * @param lowerEnd a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd].
    +   */
    +  def getOccupationNdv(
    +      higherId: Int,
    +      lowerId: Int,
    +      higherEnd: Double,
    +      lowerEnd: Double,
    +      histogram: Histogram)
    +    : Long = {
    +    val ndv: Double = if (higherEnd == lowerEnd) {
    +      1
    +    } else if (lowerId == higherId) {
    +      val curBin = histogram.bins(lowerId)
    +      getOccupation(higherEnd, lowerEnd, curBin) * curBin.ndv
    +    } else {
    +      // compute how much the [lowerEnd, higherEnd] range occupies the bins in a histogram.
    +      // Our computation has 3 parts: the smallest/min bin, the middle bins, the largest/max bin.
    +      val minCurBin = histogram.bins(lowerId)
    +      val minPartNdv = getOccupation(minCurBin.hi, lowerEnd, minCurBin) * minCurBin.ndv
    +
    +      val maxCurBin = histogram.bins(higherId)
    +      val maxPartNdv = getOccupation(higherEnd, maxCurBin.lo, maxCurBin) * maxCurBin.ndv
    +
    +      // The total ndv is minPartNdv + maxPartNdv + Ndvs between them.
    +      // In order to avoid counting same distinct value twice, we check if the upperBound value
    --- End diff --
    
    are you saying different bins may have duplicated values? cc @wzhfy 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154248457
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,197 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param histogram a numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +
    +  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
    +    var binId = 0
    --- End diff --
    
    If a user changes the data, statistics will be removed, or re-collected (only size currently), Spark already implements this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155231606
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -=1
    --- End diff --
    
    a space after `-=`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    cc @sameeragarwal @cloud-fan @gatorsmile @wzhfy 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155557144
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,171 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = 0
    +    while ((i < bins.length) && (value > bins(i).hi)) {
    +      i +=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var i = bins.length - 1
    +    while ((i >= 0) && (value < bins(i).lo)) {
    +      i -=1
    +    }
    +    i
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (curBin.hi == curBin.lo) {
    +      // the entire bin is covered in the range
    +      1.0
    +    } else if (higherValue == lowerValue) {
    +      // set percentage to 1/NDV
    +      1.0 / curBin.ndv.toDouble
    +    } else {
    +      // Use proration since the range falls inside this bin.
    +      math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
    +    }
    +  }
    +
    +  /**
    +   * Returns the number of bins for column values in [lowerValue, higherValue].
    --- End diff --
    
    i see. Let's explain this in the java doc.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r153972982
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging {
             colStatsMap.update(attr, newStats)
           }
     
    -      Some(1.0 / BigDecimal(ndv))
    -    } else {
    +      // We compute filter selectivity using Histogram information
    +      attr.dataType match {
    +        case StringType | BinaryType =>
    +          Some(1.0 / BigDecimal(ndv))
    +
    +        case _ =>
    +          // returns 1/ndv if there is no histogram
    +          if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv))
    +
    +          // We traverse histogram bins to locate the literal value
    +          val hgmBins = colStat.histogram.get.bins
    +          val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    +          // find the interval where this datum locates
    +          var lowerId, higherId = -1
    +          for (i <- hgmBins.indices) {
    +            // if datum > upperBound, just move to next bin
    +            if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i
    +            if (higherId < 0) {
    +              if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) ||
    +                ((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) {
    +                higherId = i
    +              }
    +            }
    +          }
    +          assert(lowerId <= higherId)
    +          val lowerBinNdv = hgmBins(lowerId).ndv
    +          val higherBinNdv = hgmBins(higherId).ndv
    +          // assume uniform distribution in each bin
    +          val percent = if (lowerId == higherId) {
    +            (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1)
    +          } else {
    +            1.0 / hgmBins.length * (higherId - lowerId - 1) +
    +              (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) +
    +              (1.0 / hgmBins.length) / math.max(higherBinNdv, 1)
    --- End diff --
    
    bin's ndv will never be less than 1, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19783
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84573/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154848428
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -114,4 +114,194 @@ object EstimationUtils {
         }
       }
     
    +  /**
    +   * Returns the number of the first bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the first bin into which a column values falls.
    +   */
    +  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    bins.foreach { bin =>
    +      if (value > bin.hi) binId += 1
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns the number of the last bin into which a column values falls for a specified
    +   * numeric equi-height histogram.
    +   *
    +   * @param value a literal value of a column
    +   * @param bins an array of bins for a given numeric equi-height histogram
    +   * @return the number of the last bin into which a column values falls.
    +   */
    +  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = {
    +    var binId = 0
    +    for (i <- bins.indices) {
    +      if (value > bins(i).hi) {
    +        // increment binId to point to next bin
    +        binId += 1
    +      }
    +      if ((value == bins(i).hi) && (i < bins.length - 1) && (value == bins(i + 1).lo)) {
    +        // We assume the above 3 conditions will be evaluated from left to right sequentially.
    +        // If the above 3 conditions are evaluated out-of-order, then out-of-bound error may happen.
    +        // At that time, we should split the third condition into another if statement.
    +        // increment binId since the value appears in this bin and next bin
    +        binId += 1
    +      }
    +    }
    +    binId
    +  }
    +
    +  /**
    +   * Returns a percentage of a bin holding values for column value in the range of
    +   * [lowerValue, higherValue]
    +   *
    +   * @param binId a given bin id in a specified histogram
    +   * @param higherValue a given upper bound value of a specified column value range
    +   * @param lowerValue a given lower bound value of a specified column value range
    +   * @param histogram a numeric equi-height histogram
    +   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   */
    +  private def getOccupation(
    +      binId: Int,
    +      higherValue: Double,
    +      lowerValue: Double,
    +      histogram: Histogram): Double = {
    +    val curBin = histogram.bins(binId)
    +    if (binId == 0 && curBin.hi == curBin.lo) {
    +      // the Min of the histogram occupies the whole first bin
    +      1.0
    +    } else if (binId == 0 && curBin.hi != curBin.lo) {
    +      if (higherValue == lowerValue) {
    +        // set percentage to 1/NDV
    +        1.0 / curBin.ndv.toDouble
    +      } else {
    +        // Use proration since the range falls inside this bin.
    +        (higherValue - lowerValue) / (curBin.hi - curBin.lo)
    --- End diff --
    
    this is the only branch we need to specialize for `binId=0`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org