You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zhenhua Wang (JIRA)" <ji...@apache.org> on 2016/09/30 22:29:20 UTC

[jira] [Commented] (SPARK-17074) generate histogram information for column

    [ https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537248#comment-15537248 ] 

Zhenhua Wang commented on SPARK-17074:
--------------------------------------

Hi, there's something I want to discuss here. In order to generate equi-height histograms, we need to get ndv(number of distinct values) for each bin in the histogram (this information is important in estimation).
I think we have two ways to get it:
1. Use percentile_approx to get percentiles (equi-height bin intervals), and use a new aggregate function to count ndv in each of these interval. - This takes two table scans.
2. Modify the QuantileSummaries to enable it to count distinct values at the same time when computing percentiles. - This only takes one table scan, but I'm not sure about the accuracy of ndv results.
So there's a performance vs accuracy trade off here. I tend to use the second method. What do you think? [~rxin] [~hvanhovell] [~vssrinath] [~thunterdb][~ron8hu]


> generate histogram information for column
> -----------------------------------------
>
>                 Key: SPARK-17074
>                 URL: https://issues.apache.org/jira/browse/SPARK-17074
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Optimizer
>    Affects Versions: 2.0.0
>            Reporter: Ron Hu
>
> We support two kinds of histograms: 
> -	Equi-width histogram: We have a fixed width for each column interval in the histogram.  The height of a histogram represents the frequency for those column values in a specific interval.  For this kind of histogram, its height varies for different column intervals. We use the equi-width histogram when the number of distinct values is less than 254.
> -	Equi-height histogram: For this histogram, the width of column interval varies.  The heights of all column intervals are the same.  The equi-height histogram is effective in handling skewed data distribution. We use the equi- height histogram when the number of distinct values is equal to or greater than 254.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org