You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zhenhua Wang (JIRA)" <ji...@apache.org> on 2016/09/30 22:29:20 UTC
[jira] [Commented] (SPARK-17074) generate histogram information for
column
[ https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537248#comment-15537248 ]
Zhenhua Wang commented on SPARK-17074:
--------------------------------------
Hi, there's something I want to discuss here. In order to generate equi-height histograms, we need to get ndv(number of distinct values) for each bin in the histogram (this information is important in estimation).
I think we have two ways to get it:
1. Use percentile_approx to get percentiles (equi-height bin intervals), and use a new aggregate function to count ndv in each of these interval. - This takes two table scans.
2. Modify the QuantileSummaries to enable it to count distinct values at the same time when computing percentiles. - This only takes one table scan, but I'm not sure about the accuracy of ndv results.
So there's a performance vs accuracy trade off here. I tend to use the second method. What do you think? [~rxin] [~hvanhovell] [~vssrinath] [~thunterdb][~ron8hu]
> generate histogram information for column
> -----------------------------------------
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
> Issue Type: Sub-task
> Components: Optimizer
> Affects Versions: 2.0.0
> Reporter: Ron Hu
>
> We support two kinds of histograms:
> - Equi-width histogram: We have a fixed width for each column interval in the histogram. The height of a histogram represents the frequency for those column values in a specific interval. For this kind of histogram, its height varies for different column intervals. We use the equi-width histogram when the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval varies. The heights of all column intervals are the same. The equi-height histogram is effective in handling skewed data distribution. We use the equi- height histogram when the number of distinct values is equal to or greater than 254.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org