You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/12/12 14:37:48 UTC

[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19952

    [SPARK-21322][SQL][followup] support histogram in filter cardinality estimation

    ## What changes were proposed in this pull request?
    
    some code cleanup/refactor and naming improvement.
    
    ## How was this patch tested?
    
    existing tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark minor

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19952.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19952
    
----
commit 3f1f3b1f70ce3f42e0d861a7b37d4e19e7a90d38
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-12-12T14:36:02Z

    code cleanup

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84773/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...

Posted by ron8hu <gi...@git.apache.org>.
Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19952#discussion_r156555044
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -147,65 +139,76 @@ object EstimationUtils {
       }
     
       /**
    -   * Returns a percentage of a bin holding values for column value in the range of
    -   * [lowerValue, higherValue]
    -   *
    -   * @param higherValue a given upper bound value of a specified column value range
    -   * @param lowerValue a given lower bound value of a specified column value range
    -   * @param bin a single histogram bin
    -   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   * Returns the possibility of the given histogram bin holding values within the given range
    +   * [lowerBound, upperBound].
        */
    -  private def getOccupation(
    -      higherValue: Double,
    -      lowerValue: Double,
    +  private def binHoldingRangePossibility(
    +      upperBound: Double,
    +      lowerBound: Double,
           bin: HistogramBin): Double = {
    -    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    assert(bin.lo <= lowerBound && lowerBound <= upperBound && upperBound <= bin.hi)
         if (bin.hi == bin.lo) {
           // the entire bin is covered in the range
           1.0
    -    } else if (higherValue == lowerValue) {
    +    } else if (upperBound == lowerBound) {
           // set percentage to 1/NDV
           1.0 / bin.ndv.toDouble
         } else {
           // Use proration since the range falls inside this bin.
    -      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +      math.min((upperBound - lowerBound) / (bin.hi - bin.lo), 1.0)
         }
       }
     
       /**
    -   * Returns the number of bins for column values in [lowerValue, higherValue].
    -   * The column value distribution is saved in an equi-height histogram.  The return values is a
    -   * double value is because we may return a portion of a bin. For example, a predicate
    -   * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values.
    +   * Returns the number of histogram bins holding values within the given range
    +   * [lowerBound, upperBound].
    +   *
    +   * Note that the returned value is double type, because the range boundaries usually occupy a
    +   * portion of a bin. An extrema case is [value, value] which is generated by equal predicate
    --- End diff --
    
    typo: extreme


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    **[Test build #84773 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84773/testReport)** for PR 19952 at commit [`ebcd6d1`](https://github.com/apache/spark/commit/ebcd6d107087cf024362b3a901514cfc12c3dbab).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84827/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84784/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    **[Test build #84822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84822/testReport)** for PR 19952 at commit [`8fe0c49`](https://github.com/apache/spark/commit/8fe0c4991b90781a7017de4938705bbc32244dc6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    **[Test build #84784 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84784/testReport)** for PR 19952 at commit [`ebcd6d1`](https://github.com/apache/spark/commit/ebcd6d107087cf024362b3a901514cfc12c3dbab).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    **[Test build #84827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84827/testReport)** for PR 19952 at commit [`4e35c43`](https://github.com/apache/spark/commit/4e35c43957cf27b105c8f6b8ff19621aac540098).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19952#discussion_r156384127
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -147,65 +139,78 @@ object EstimationUtils {
       }
     
       /**
    -   * Returns a percentage of a bin holding values for column value in the range of
    -   * [lowerValue, higherValue]
    -   *
    -   * @param higherValue a given upper bound value of a specified column value range
    -   * @param lowerValue a given lower bound value of a specified column value range
    -   * @param bin a single histogram bin
    -   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   * Returns the possibility of the given histogram bin holding values within the given range
    +   * [lowerBound, upperBound].
        */
    -  private def getOccupation(
    -      higherValue: Double,
    -      lowerValue: Double,
    +  private def binHoldingRangePossibility(
    +      upperBound: Double,
    +      lowerBound: Double,
           bin: HistogramBin): Double = {
    -    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    assert(bin.lo <= lowerBound && lowerBound <= upperBound && upperBound <= bin.hi)
         if (bin.hi == bin.lo) {
           // the entire bin is covered in the range
           1.0
    -    } else if (higherValue == lowerValue) {
    +    } else if (upperBound == lowerBound) {
           // set percentage to 1/NDV
           1.0 / bin.ndv.toDouble
         } else {
           // Use proration since the range falls inside this bin.
    -      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +      math.min((upperBound - lowerBound) / (bin.hi - bin.lo), 1.0)
         }
       }
     
       /**
    -   * Returns the number of bins for column values in [lowerValue, higherValue].
    -   * The column value distribution is saved in an equi-height histogram.  The return values is a
    -   * double value is because we may return a portion of a bin. For example, a predicate
    -   * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values.
    +   * Returns the number of histogram bins holding values within the given range
    +   * [lowerBound, upperBound].
        *
    -   * @param higherId id of the high end bin holding the high end value of a column range
    -   * @param lowerId id of the low end bin holding the low end value of a column range
    -   * @param higherEnd a given upper bound value of a specified column value range
    -   * @param lowerEnd a given lower bound value of a specified column value range
    +   * Note that the return value is double type, because the range boundaries usually occupy a
    +   * portion of a bin. An extrema case is [value, value] which is generated by equal predicate
    +   * `col = value`, we can get more accuracy by allowing returning portion of histogram bins.
    +   *
    +   * @param upperBound the highest value of the given range
    +   * @param upperBoundInclusive whether the upperBound is included in the range
    +   * @param lowerBound the lowest value of the given range
    +   * @param lowerBoundInclusive whether the lowerBound is included in the range
    --- End diff --
    
    instead of asking the callers to pass in the upper and lower bin id, it's more intuitive to pass whether to include the range boundaries.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19952#discussion_r156551478
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -147,65 +139,78 @@ object EstimationUtils {
       }
     
       /**
    -   * Returns a percentage of a bin holding values for column value in the range of
    -   * [lowerValue, higherValue]
    -   *
    -   * @param higherValue a given upper bound value of a specified column value range
    -   * @param lowerValue a given lower bound value of a specified column value range
    -   * @param bin a single histogram bin
    -   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   * Returns the possibility of the given histogram bin holding values within the given range
    +   * [lowerBound, upperBound].
        */
    -  private def getOccupation(
    -      higherValue: Double,
    -      lowerValue: Double,
    +  private def binHoldingRangePossibility(
    +      upperBound: Double,
    +      lowerBound: Double,
           bin: HistogramBin): Double = {
    -    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    assert(bin.lo <= lowerBound && lowerBound <= upperBound && upperBound <= bin.hi)
         if (bin.hi == bin.lo) {
           // the entire bin is covered in the range
           1.0
    -    } else if (higherValue == lowerValue) {
    +    } else if (upperBound == lowerBound) {
           // set percentage to 1/NDV
           1.0 / bin.ndv.toDouble
         } else {
           // Use proration since the range falls inside this bin.
    -      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +      math.min((upperBound - lowerBound) / (bin.hi - bin.lo), 1.0)
         }
       }
     
       /**
    -   * Returns the number of bins for column values in [lowerValue, higherValue].
    -   * The column value distribution is saved in an equi-height histogram.  The return values is a
    -   * double value is because we may return a portion of a bin. For example, a predicate
    -   * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values.
    +   * Returns the number of histogram bins holding values within the given range
    +   * [lowerBound, upperBound].
        *
    -   * @param higherId id of the high end bin holding the high end value of a column range
    -   * @param lowerId id of the low end bin holding the low end value of a column range
    -   * @param higherEnd a given upper bound value of a specified column value range
    -   * @param lowerEnd a given lower bound value of a specified column value range
    +   * Note that the return value is double type, because the range boundaries usually occupy a
    +   * portion of a bin. An extrema case is [value, value] which is generated by equal predicate
    +   * `col = value`, we can get more accuracy by allowing returning portion of histogram bins.
    +   *
    +   * @param upperBound the highest value of the given range
    +   * @param upperBoundInclusive whether the upperBound is included in the range
    +   * @param lowerBound the lowest value of the given range
    +   * @param lowerBoundInclusive whether the lowerBound is included in the range
        * @param histogram a numeric equi-height histogram
    -   * @return the number of bins for column values in [lowerEnd, higherEnd].
        */
    -  def getOccupationBins(
    -      higherId: Int,
    -      lowerId: Int,
    -      higherEnd: Double,
    -      lowerEnd: Double,
    +  def numBinsHoldingRange(
    +      upperBound: Double,
    +      upperBoundInclusive: Boolean,
    +      lowerBound: Double,
    +      lowerBoundInclusive: Boolean,
           histogram: Histogram): Double = {
    --- End diff --
    
    Is it better to pass the bin array instead of histogram? we can simplify many `histogram.bins` here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19952#discussion_r156383617
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -115,14 +115,10 @@ object EstimationUtils {
       }
     
       /**
    -   * Returns the number of the first bin into which a column value falls for a specified
    +   * Returns the index of the first bin into which the given value falls for a specified
        * numeric equi-height histogram.
    -   *
    -   * @param value a literal value of a column
    -   * @param bins an array of bins for a given numeric equi-height histogram
    -   * @return the id of the first bin into which a column value falls.
    --- End diff --
    
    Now it's a private method and we can omit the parameter doc as it's trivial.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19952#discussion_r156549416
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -147,65 +139,78 @@ object EstimationUtils {
       }
     
       /**
    -   * Returns a percentage of a bin holding values for column value in the range of
    -   * [lowerValue, higherValue]
    -   *
    -   * @param higherValue a given upper bound value of a specified column value range
    -   * @param lowerValue a given lower bound value of a specified column value range
    -   * @param bin a single histogram bin
    -   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   * Returns the possibility of the given histogram bin holding values within the given range
    +   * [lowerBound, upperBound].
        */
    -  private def getOccupation(
    -      higherValue: Double,
    -      lowerValue: Double,
    +  private def binHoldingRangePossibility(
    +      upperBound: Double,
    +      lowerBound: Double,
           bin: HistogramBin): Double = {
    -    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    assert(bin.lo <= lowerBound && lowerBound <= upperBound && upperBound <= bin.hi)
         if (bin.hi == bin.lo) {
           // the entire bin is covered in the range
           1.0
    -    } else if (higherValue == lowerValue) {
    +    } else if (upperBound == lowerBound) {
           // set percentage to 1/NDV
           1.0 / bin.ndv.toDouble
         } else {
           // Use proration since the range falls inside this bin.
    -      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +      math.min((upperBound - lowerBound) / (bin.hi - bin.lo), 1.0)
         }
       }
     
       /**
    -   * Returns the number of bins for column values in [lowerValue, higherValue].
    -   * The column value distribution is saved in an equi-height histogram.  The return values is a
    -   * double value is because we may return a portion of a bin. For example, a predicate
    -   * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values.
    +   * Returns the number of histogram bins holding values within the given range
    +   * [lowerBound, upperBound].
        *
    -   * @param higherId id of the high end bin holding the high end value of a column range
    -   * @param lowerId id of the low end bin holding the low end value of a column range
    -   * @param higherEnd a given upper bound value of a specified column value range
    -   * @param lowerEnd a given lower bound value of a specified column value range
    +   * Note that the return value is double type, because the range boundaries usually occupy a
    --- End diff --
    
    nit: returned value


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84822/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    **[Test build #84822 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84822/testReport)** for PR 19952 at commit [`8fe0c49`](https://github.com/apache/spark/commit/8fe0c4991b90781a7017de4938705bbc32244dc6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19952#discussion_r156550872
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -574,51 +539,90 @@ case class FilterEstimation(plan: Filter) extends Logging {
       }
     
       /**
    -   * Returns the selectivity percentage for binary condition in the column's
    -   * current valid range [min, max]
    -   *
    -   * @param op a binary comparison operator
    -   * @param histogram a numeric equi-height histogram
    -   * @param max the upper bound of the current valid range for a given column
    -   * @param min the lower bound of the current valid range for a given column
    -   * @param datumNumber the numeric value of a literal
    -   * @return the selectivity percentage for a condition in the current range.
    +   * Computes the possibility of a equal predicate using histogram.
    --- End diff --
    
    nit: an equality predicate


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    cc @ron8hu @wzhfy 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    **[Test build #84827 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84827/testReport)** for PR 19952 at commit [`4e35c43`](https://github.com/apache/spark/commit/4e35c43957cf27b105c8f6b8ff19621aac540098).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19952


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19952#discussion_r156549603
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
    @@ -147,65 +139,78 @@ object EstimationUtils {
       }
     
       /**
    -   * Returns a percentage of a bin holding values for column value in the range of
    -   * [lowerValue, higherValue]
    -   *
    -   * @param higherValue a given upper bound value of a specified column value range
    -   * @param lowerValue a given lower bound value of a specified column value range
    -   * @param bin a single histogram bin
    -   * @return the percentage of a single bin holding values in [lowerValue, higherValue].
    +   * Returns the possibility of the given histogram bin holding values within the given range
    +   * [lowerBound, upperBound].
        */
    -  private def getOccupation(
    -      higherValue: Double,
    -      lowerValue: Double,
    +  private def binHoldingRangePossibility(
    +      upperBound: Double,
    +      lowerBound: Double,
           bin: HistogramBin): Double = {
    -    assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
    +    assert(bin.lo <= lowerBound && lowerBound <= upperBound && upperBound <= bin.hi)
         if (bin.hi == bin.lo) {
           // the entire bin is covered in the range
           1.0
    -    } else if (higherValue == lowerValue) {
    +    } else if (upperBound == lowerBound) {
           // set percentage to 1/NDV
           1.0 / bin.ndv.toDouble
         } else {
           // Use proration since the range falls inside this bin.
    -      math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
    +      math.min((upperBound - lowerBound) / (bin.hi - bin.lo), 1.0)
         }
       }
     
       /**
    -   * Returns the number of bins for column values in [lowerValue, higherValue].
    -   * The column value distribution is saved in an equi-height histogram.  The return values is a
    -   * double value is because we may return a portion of a bin. For example, a predicate
    -   * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values.
    +   * Returns the number of histogram bins holding values within the given range
    +   * [lowerBound, upperBound].
        *
    -   * @param higherId id of the high end bin holding the high end value of a column range
    -   * @param lowerId id of the low end bin holding the low end value of a column range
    -   * @param higherEnd a given upper bound value of a specified column value range
    -   * @param lowerEnd a given lower bound value of a specified column value range
    +   * Note that the return value is double type, because the range boundaries usually occupy a
    +   * portion of a bin. An extrema case is [value, value] which is generated by equal predicate
    +   * `col = value`, we can get more accuracy by allowing returning portion of histogram bins.
    --- End diff --
    
    nit: get higher accuracy


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    **[Test build #84773 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84773/testReport)** for PR 19952 at commit [`ebcd6d1`](https://github.com/apache/spark/commit/ebcd6d107087cf024362b3a901514cfc12c3dbab).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19952#discussion_r156552208
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
    @@ -574,51 +539,90 @@ case class FilterEstimation(plan: Filter) extends Logging {
       }
     
       /**
    -   * Returns the selectivity percentage for binary condition in the column's
    -   * current valid range [min, max]
    -   *
    -   * @param op a binary comparison operator
    -   * @param histogram a numeric equi-height histogram
    -   * @param max the upper bound of the current valid range for a given column
    -   * @param min the lower bound of the current valid range for a given column
    -   * @param datumNumber the numeric value of a literal
    -   * @return the selectivity percentage for a condition in the current range.
    +   * Computes the possibility of a equal predicate using histogram.
        */
    +  private def computeEqualityPossibilityByHistogram(
    +      literal: Literal, colStat: ColumnStat): Double = {
    +    val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    +    val histogram = colStat.histogram.get
     
    -  def computePercentByEquiHeightHgm(
    -      op: BinaryComparison,
    -      histogram: Histogram,
    -      max: Double,
    -      min: Double,
    -      datumNumber: Double): Double = {
         // find bins where column's current min and max locate.  Note that a column's [min, max]
         // range may change due to another condition applied earlier.
    -    val minBinId = EstimationUtils.findFirstBinForValue(min, histogram.bins)
    -    val maxBinId = EstimationUtils.findLastBinForValue(max, histogram.bins)
    +    val min = EstimationUtils.toDecimal(colStat.min.get, literal.dataType).toDouble
    +    val max = EstimationUtils.toDecimal(colStat.max.get, literal.dataType).toDouble
     
         // compute how many bins the column's current valid range [min, max] occupies.
    -    // Note that a column's [min, max] range may vary after we apply some filter conditions.
    -    val minToMaxLength = EstimationUtils.getOccupationBins(maxBinId, minBinId, max, min, histogram)
    -
    -    val datumInBinId = op match {
    -      case LessThan(_, _) | GreaterThanOrEqual(_, _) =>
    -        EstimationUtils.findFirstBinForValue(datumNumber, histogram.bins)
    -      case LessThanOrEqual(_, _) | GreaterThan(_, _) =>
    -        EstimationUtils.findLastBinForValue(datumNumber, histogram.bins)
    -    }
    +    val numBinsHoldingEntireRange = EstimationUtils.numBinsHoldingRange(
    +      upperBound = max,
    +      upperBoundInclusive = true,
    +      lowerBound = min,
    +      lowerBoundInclusive = true,
    +      histogram)
    +
    +    val numBinsHoldingDatum = EstimationUtils.numBinsHoldingRange(
    +      upperBound = datum,
    +      upperBoundInclusive = true,
    +      lowerBound = datum,
    +      lowerBoundInclusive = true,
    +      histogram)
    +
    +    numBinsHoldingDatum / numBinsHoldingEntireRange
    +  }
     
    -    op match {
    -      // LessThan and LessThanOrEqual share the same logic,
    -      // but their datumInBinId may be different
    -      case LessThan(_, _) | LessThanOrEqual(_, _) =>
    -        EstimationUtils.getOccupationBins(datumInBinId, minBinId, datumNumber, min,
    -          histogram) / minToMaxLength
    -      // GreaterThan and GreaterThanOrEqual share the same logic,
    -      // but their datumInBinId may be different
    -      case GreaterThan(_, _) | GreaterThanOrEqual(_, _) =>
    -        EstimationUtils.getOccupationBins(maxBinId, datumInBinId, max, datumNumber,
    -          histogram) / minToMaxLength
    +  /**
    +   * Computes the possibility of a comparison predicate using histogram.
    +   */
    +  private def computeComparisonPossibilityByHistogram(
    +      op: BinaryComparison, literal: Literal, colStat: ColumnStat): Double = {
    +    val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
    +    val histogram = colStat.histogram.get
    +
    +    // find bins where column's current min and max locate.  Note that a column's [min, max]
    +    // range may change due to another condition applied earlier.
    +    val min = EstimationUtils.toDecimal(colStat.min.get, literal.dataType).toDouble
    +    val max = EstimationUtils.toDecimal(colStat.max.get, literal.dataType).toDouble
    +
    +    // compute how many bins the column's current valid range [min, max] occupies.
    +    val numBinsHoldingEntireRange = EstimationUtils.numBinsHoldingRange(
    +      max, upperBoundInclusive = true, min, lowerBoundInclusive = true, histogram)
    +
    +    val numBinsHoldingDatum = op match {
    --- End diff --
    
    `numBinsHoldingRange`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    **[Test build #84784 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84784/testReport)** for PR 19952 at commit [`ebcd6d1`](https://github.com/apache/spark/commit/ebcd6d107087cf024362b3a901514cfc12c3dbab).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org