You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/12/12 14:37:48 UTC
[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...
GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/19952
[SPARK-21322][SQL][followup] support histogram in filter cardinality estimation
## What changes were proposed in this pull request?
some code cleanup/refactor and naming improvement.
## How was this patch tested?
existing tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark minor
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19952.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19952
----
commit 3f1f3b1f70ce3f42e0d861a7b37d4e19e7a90d38
Author: Wenchen Fan <we...@databricks.com>
Date: 2017-12-12T14:36:02Z
code cleanup
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19952
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84773/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...
Posted by ron8hu <gi...@git.apache.org>.
Github user ron8hu commented on a diff in the pull request:
https://github.com/apache/spark/pull/19952#discussion_r156555044
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
@@ -147,65 +139,76 @@ object EstimationUtils {
}
/**
- * Returns a percentage of a bin holding values for column value in the range of
- * [lowerValue, higherValue]
- *
- * @param higherValue a given upper bound value of a specified column value range
- * @param lowerValue a given lower bound value of a specified column value range
- * @param bin a single histogram bin
- * @return the percentage of a single bin holding values in [lowerValue, higherValue].
+ * Returns the possibility of the given histogram bin holding values within the given range
+ * [lowerBound, upperBound].
*/
- private def getOccupation(
- higherValue: Double,
- lowerValue: Double,
+ private def binHoldingRangePossibility(
+ upperBound: Double,
+ lowerBound: Double,
bin: HistogramBin): Double = {
- assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
+ assert(bin.lo <= lowerBound && lowerBound <= upperBound && upperBound <= bin.hi)
if (bin.hi == bin.lo) {
// the entire bin is covered in the range
1.0
- } else if (higherValue == lowerValue) {
+ } else if (upperBound == lowerBound) {
// set percentage to 1/NDV
1.0 / bin.ndv.toDouble
} else {
// Use proration since the range falls inside this bin.
- math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+ math.min((upperBound - lowerBound) / (bin.hi - bin.lo), 1.0)
}
}
/**
- * Returns the number of bins for column values in [lowerValue, higherValue].
- * The column value distribution is saved in an equi-height histogram. The return values is a
- * double value is because we may return a portion of a bin. For example, a predicate
- * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values.
+ * Returns the number of histogram bins holding values within the given range
+ * [lowerBound, upperBound].
+ *
+ * Note that the returned value is double type, because the range boundaries usually occupy a
+ * portion of a bin. An extrema case is [value, value] which is generated by equal predicate
--- End diff --
typo: extreme
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19952
**[Test build #84773 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84773/testReport)** for PR 19952 at commit [`ebcd6d1`](https://github.com/apache/spark/commit/ebcd6d107087cf024362b3a901514cfc12c3dbab).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19952
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84827/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19952
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19952
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84784/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19952
**[Test build #84822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84822/testReport)** for PR 19952 at commit [`8fe0c49`](https://github.com/apache/spark/commit/8fe0c4991b90781a7017de4938705bbc32244dc6).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19952
**[Test build #84784 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84784/testReport)** for PR 19952 at commit [`ebcd6d1`](https://github.com/apache/spark/commit/ebcd6d107087cf024362b3a901514cfc12c3dbab).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19952
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19952
**[Test build #84827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84827/testReport)** for PR 19952 at commit [`4e35c43`](https://github.com/apache/spark/commit/4e35c43957cf27b105c8f6b8ff19621aac540098).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19952#discussion_r156384127
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
@@ -147,65 +139,78 @@ object EstimationUtils {
}
/**
- * Returns a percentage of a bin holding values for column value in the range of
- * [lowerValue, higherValue]
- *
- * @param higherValue a given upper bound value of a specified column value range
- * @param lowerValue a given lower bound value of a specified column value range
- * @param bin a single histogram bin
- * @return the percentage of a single bin holding values in [lowerValue, higherValue].
+ * Returns the possibility of the given histogram bin holding values within the given range
+ * [lowerBound, upperBound].
*/
- private def getOccupation(
- higherValue: Double,
- lowerValue: Double,
+ private def binHoldingRangePossibility(
+ upperBound: Double,
+ lowerBound: Double,
bin: HistogramBin): Double = {
- assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
+ assert(bin.lo <= lowerBound && lowerBound <= upperBound && upperBound <= bin.hi)
if (bin.hi == bin.lo) {
// the entire bin is covered in the range
1.0
- } else if (higherValue == lowerValue) {
+ } else if (upperBound == lowerBound) {
// set percentage to 1/NDV
1.0 / bin.ndv.toDouble
} else {
// Use proration since the range falls inside this bin.
- math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+ math.min((upperBound - lowerBound) / (bin.hi - bin.lo), 1.0)
}
}
/**
- * Returns the number of bins for column values in [lowerValue, higherValue].
- * The column value distribution is saved in an equi-height histogram. The return values is a
- * double value is because we may return a portion of a bin. For example, a predicate
- * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values.
+ * Returns the number of histogram bins holding values within the given range
+ * [lowerBound, upperBound].
*
- * @param higherId id of the high end bin holding the high end value of a column range
- * @param lowerId id of the low end bin holding the low end value of a column range
- * @param higherEnd a given upper bound value of a specified column value range
- * @param lowerEnd a given lower bound value of a specified column value range
+ * Note that the return value is double type, because the range boundaries usually occupy a
+ * portion of a bin. An extrema case is [value, value] which is generated by equal predicate
+ * `col = value`, we can get more accuracy by allowing returning portion of histogram bins.
+ *
+ * @param upperBound the highest value of the given range
+ * @param upperBoundInclusive whether the upperBound is included in the range
+ * @param lowerBound the lowest value of the given range
+ * @param lowerBoundInclusive whether the lowerBound is included in the range
--- End diff --
instead of asking the callers to pass in the upper and lower bin id, it's more intuitive to pass whether to include the range boundaries.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:
https://github.com/apache/spark/pull/19952#discussion_r156551478
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
@@ -147,65 +139,78 @@ object EstimationUtils {
}
/**
- * Returns a percentage of a bin holding values for column value in the range of
- * [lowerValue, higherValue]
- *
- * @param higherValue a given upper bound value of a specified column value range
- * @param lowerValue a given lower bound value of a specified column value range
- * @param bin a single histogram bin
- * @return the percentage of a single bin holding values in [lowerValue, higherValue].
+ * Returns the possibility of the given histogram bin holding values within the given range
+ * [lowerBound, upperBound].
*/
- private def getOccupation(
- higherValue: Double,
- lowerValue: Double,
+ private def binHoldingRangePossibility(
+ upperBound: Double,
+ lowerBound: Double,
bin: HistogramBin): Double = {
- assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
+ assert(bin.lo <= lowerBound && lowerBound <= upperBound && upperBound <= bin.hi)
if (bin.hi == bin.lo) {
// the entire bin is covered in the range
1.0
- } else if (higherValue == lowerValue) {
+ } else if (upperBound == lowerBound) {
// set percentage to 1/NDV
1.0 / bin.ndv.toDouble
} else {
// Use proration since the range falls inside this bin.
- math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+ math.min((upperBound - lowerBound) / (bin.hi - bin.lo), 1.0)
}
}
/**
- * Returns the number of bins for column values in [lowerValue, higherValue].
- * The column value distribution is saved in an equi-height histogram. The return values is a
- * double value is because we may return a portion of a bin. For example, a predicate
- * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values.
+ * Returns the number of histogram bins holding values within the given range
+ * [lowerBound, upperBound].
*
- * @param higherId id of the high end bin holding the high end value of a column range
- * @param lowerId id of the low end bin holding the low end value of a column range
- * @param higherEnd a given upper bound value of a specified column value range
- * @param lowerEnd a given lower bound value of a specified column value range
+ * Note that the return value is double type, because the range boundaries usually occupy a
+ * portion of a bin. An extrema case is [value, value] which is generated by equal predicate
+ * `col = value`, we can get more accuracy by allowing returning portion of histogram bins.
+ *
+ * @param upperBound the highest value of the given range
+ * @param upperBoundInclusive whether the upperBound is included in the range
+ * @param lowerBound the lowest value of the given range
+ * @param lowerBoundInclusive whether the lowerBound is included in the range
* @param histogram a numeric equi-height histogram
- * @return the number of bins for column values in [lowerEnd, higherEnd].
*/
- def getOccupationBins(
- higherId: Int,
- lowerId: Int,
- higherEnd: Double,
- lowerEnd: Double,
+ def numBinsHoldingRange(
+ upperBound: Double,
+ upperBoundInclusive: Boolean,
+ lowerBound: Double,
+ lowerBoundInclusive: Boolean,
histogram: Histogram): Double = {
--- End diff --
Is it better to pass the bin array instead of histogram? we can simplify many `histogram.bins` here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19952#discussion_r156383617
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
@@ -115,14 +115,10 @@ object EstimationUtils {
}
/**
- * Returns the number of the first bin into which a column value falls for a specified
+ * Returns the index of the first bin into which the given value falls for a specified
* numeric equi-height histogram.
- *
- * @param value a literal value of a column
- * @param bins an array of bins for a given numeric equi-height histogram
- * @return the id of the first bin into which a column value falls.
--- End diff --
Now it's a private method and we can omit the parameter doc as it's trivial.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19952
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19952
thanks, merging to master!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:
https://github.com/apache/spark/pull/19952#discussion_r156549416
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
@@ -147,65 +139,78 @@ object EstimationUtils {
}
/**
- * Returns a percentage of a bin holding values for column value in the range of
- * [lowerValue, higherValue]
- *
- * @param higherValue a given upper bound value of a specified column value range
- * @param lowerValue a given lower bound value of a specified column value range
- * @param bin a single histogram bin
- * @return the percentage of a single bin holding values in [lowerValue, higherValue].
+ * Returns the possibility of the given histogram bin holding values within the given range
+ * [lowerBound, upperBound].
*/
- private def getOccupation(
- higherValue: Double,
- lowerValue: Double,
+ private def binHoldingRangePossibility(
+ upperBound: Double,
+ lowerBound: Double,
bin: HistogramBin): Double = {
- assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
+ assert(bin.lo <= lowerBound && lowerBound <= upperBound && upperBound <= bin.hi)
if (bin.hi == bin.lo) {
// the entire bin is covered in the range
1.0
- } else if (higherValue == lowerValue) {
+ } else if (upperBound == lowerBound) {
// set percentage to 1/NDV
1.0 / bin.ndv.toDouble
} else {
// Use proration since the range falls inside this bin.
- math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+ math.min((upperBound - lowerBound) / (bin.hi - bin.lo), 1.0)
}
}
/**
- * Returns the number of bins for column values in [lowerValue, higherValue].
- * The column value distribution is saved in an equi-height histogram. The return values is a
- * double value is because we may return a portion of a bin. For example, a predicate
- * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values.
+ * Returns the number of histogram bins holding values within the given range
+ * [lowerBound, upperBound].
*
- * @param higherId id of the high end bin holding the high end value of a column range
- * @param lowerId id of the low end bin holding the low end value of a column range
- * @param higherEnd a given upper bound value of a specified column value range
- * @param lowerEnd a given lower bound value of a specified column value range
+ * Note that the return value is double type, because the range boundaries usually occupy a
--- End diff --
nit: returned value
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19952
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84822/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19952
**[Test build #84822 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84822/testReport)** for PR 19952 at commit [`8fe0c49`](https://github.com/apache/spark/commit/8fe0c4991b90781a7017de4938705bbc32244dc6).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:
https://github.com/apache/spark/pull/19952#discussion_r156550872
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
@@ -574,51 +539,90 @@ case class FilterEstimation(plan: Filter) extends Logging {
}
/**
- * Returns the selectivity percentage for binary condition in the column's
- * current valid range [min, max]
- *
- * @param op a binary comparison operator
- * @param histogram a numeric equi-height histogram
- * @param max the upper bound of the current valid range for a given column
- * @param min the lower bound of the current valid range for a given column
- * @param datumNumber the numeric value of a literal
- * @return the selectivity percentage for a condition in the current range.
+ * Computes the possibility of a equal predicate using histogram.
--- End diff --
nit: an equality predicate
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19952
cc @ron8hu @wzhfy
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19952
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19952
**[Test build #84827 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84827/testReport)** for PR 19952 at commit [`4e35c43`](https://github.com/apache/spark/commit/4e35c43957cf27b105c8f6b8ff19621aac540098).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/19952
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:
https://github.com/apache/spark/pull/19952#discussion_r156549603
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala ---
@@ -147,65 +139,78 @@ object EstimationUtils {
}
/**
- * Returns a percentage of a bin holding values for column value in the range of
- * [lowerValue, higherValue]
- *
- * @param higherValue a given upper bound value of a specified column value range
- * @param lowerValue a given lower bound value of a specified column value range
- * @param bin a single histogram bin
- * @return the percentage of a single bin holding values in [lowerValue, higherValue].
+ * Returns the possibility of the given histogram bin holding values within the given range
+ * [lowerBound, upperBound].
*/
- private def getOccupation(
- higherValue: Double,
- lowerValue: Double,
+ private def binHoldingRangePossibility(
+ upperBound: Double,
+ lowerBound: Double,
bin: HistogramBin): Double = {
- assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi)
+ assert(bin.lo <= lowerBound && lowerBound <= upperBound && upperBound <= bin.hi)
if (bin.hi == bin.lo) {
// the entire bin is covered in the range
1.0
- } else if (higherValue == lowerValue) {
+ } else if (upperBound == lowerBound) {
// set percentage to 1/NDV
1.0 / bin.ndv.toDouble
} else {
// Use proration since the range falls inside this bin.
- math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+ math.min((upperBound - lowerBound) / (bin.hi - bin.lo), 1.0)
}
}
/**
- * Returns the number of bins for column values in [lowerValue, higherValue].
- * The column value distribution is saved in an equi-height histogram. The return values is a
- * double value is because we may return a portion of a bin. For example, a predicate
- * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values.
+ * Returns the number of histogram bins holding values within the given range
+ * [lowerBound, upperBound].
*
- * @param higherId id of the high end bin holding the high end value of a column range
- * @param lowerId id of the low end bin holding the low end value of a column range
- * @param higherEnd a given upper bound value of a specified column value range
- * @param lowerEnd a given lower bound value of a specified column value range
+ * Note that the return value is double type, because the range boundaries usually occupy a
+ * portion of a bin. An extrema case is [value, value] which is generated by equal predicate
+ * `col = value`, we can get more accuracy by allowing returning portion of histogram bins.
--- End diff --
nit: get higher accuracy
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19952
**[Test build #84773 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84773/testReport)** for PR 19952 at commit [`ebcd6d1`](https://github.com/apache/spark/commit/ebcd6d107087cf024362b3a901514cfc12c3dbab).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19952: [SPARK-21322][SQL][followup] support histogram in...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:
https://github.com/apache/spark/pull/19952#discussion_r156552208
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala ---
@@ -574,51 +539,90 @@ case class FilterEstimation(plan: Filter) extends Logging {
}
/**
- * Returns the selectivity percentage for binary condition in the column's
- * current valid range [min, max]
- *
- * @param op a binary comparison operator
- * @param histogram a numeric equi-height histogram
- * @param max the upper bound of the current valid range for a given column
- * @param min the lower bound of the current valid range for a given column
- * @param datumNumber the numeric value of a literal
- * @return the selectivity percentage for a condition in the current range.
+ * Computes the possibility of a equal predicate using histogram.
*/
+ private def computeEqualityPossibilityByHistogram(
+ literal: Literal, colStat: ColumnStat): Double = {
+ val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
+ val histogram = colStat.histogram.get
- def computePercentByEquiHeightHgm(
- op: BinaryComparison,
- histogram: Histogram,
- max: Double,
- min: Double,
- datumNumber: Double): Double = {
// find bins where column's current min and max locate. Note that a column's [min, max]
// range may change due to another condition applied earlier.
- val minBinId = EstimationUtils.findFirstBinForValue(min, histogram.bins)
- val maxBinId = EstimationUtils.findLastBinForValue(max, histogram.bins)
+ val min = EstimationUtils.toDecimal(colStat.min.get, literal.dataType).toDouble
+ val max = EstimationUtils.toDecimal(colStat.max.get, literal.dataType).toDouble
// compute how many bins the column's current valid range [min, max] occupies.
- // Note that a column's [min, max] range may vary after we apply some filter conditions.
- val minToMaxLength = EstimationUtils.getOccupationBins(maxBinId, minBinId, max, min, histogram)
-
- val datumInBinId = op match {
- case LessThan(_, _) | GreaterThanOrEqual(_, _) =>
- EstimationUtils.findFirstBinForValue(datumNumber, histogram.bins)
- case LessThanOrEqual(_, _) | GreaterThan(_, _) =>
- EstimationUtils.findLastBinForValue(datumNumber, histogram.bins)
- }
+ val numBinsHoldingEntireRange = EstimationUtils.numBinsHoldingRange(
+ upperBound = max,
+ upperBoundInclusive = true,
+ lowerBound = min,
+ lowerBoundInclusive = true,
+ histogram)
+
+ val numBinsHoldingDatum = EstimationUtils.numBinsHoldingRange(
+ upperBound = datum,
+ upperBoundInclusive = true,
+ lowerBound = datum,
+ lowerBoundInclusive = true,
+ histogram)
+
+ numBinsHoldingDatum / numBinsHoldingEntireRange
+ }
- op match {
- // LessThan and LessThanOrEqual share the same logic,
- // but their datumInBinId may be different
- case LessThan(_, _) | LessThanOrEqual(_, _) =>
- EstimationUtils.getOccupationBins(datumInBinId, minBinId, datumNumber, min,
- histogram) / minToMaxLength
- // GreaterThan and GreaterThanOrEqual share the same logic,
- // but their datumInBinId may be different
- case GreaterThan(_, _) | GreaterThanOrEqual(_, _) =>
- EstimationUtils.getOccupationBins(maxBinId, datumInBinId, max, datumNumber,
- histogram) / minToMaxLength
+ /**
+ * Computes the possibility of a comparison predicate using histogram.
+ */
+ private def computeComparisonPossibilityByHistogram(
+ op: BinaryComparison, literal: Literal, colStat: ColumnStat): Double = {
+ val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble
+ val histogram = colStat.histogram.get
+
+ // find bins where column's current min and max locate. Note that a column's [min, max]
+ // range may change due to another condition applied earlier.
+ val min = EstimationUtils.toDecimal(colStat.min.get, literal.dataType).toDouble
+ val max = EstimationUtils.toDecimal(colStat.max.get, literal.dataType).toDouble
+
+ // compute how many bins the column's current valid range [min, max] occupies.
+ val numBinsHoldingEntireRange = EstimationUtils.numBinsHoldingRange(
+ max, upperBoundInclusive = true, min, lowerBoundInclusive = true, histogram)
+
+ val numBinsHoldingDatum = op match {
--- End diff --
`numBinsHoldingRange`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19952
**[Test build #84784 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84784/testReport)** for PR 19952 at commit [`ebcd6d1`](https://github.com/apache/spark/commit/ebcd6d107087cf024362b3a901514cfc12c3dbab).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19952: [SPARK-21322][SQL][followup] support histogram in filter...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19952
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org