You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zhenhua Wang (JIRA)" <ji...@apache.org> on 2016/10/19 02:53:58 UTC

[jira] [Created] (SPARK-18000) Aggregation function for computing endpoints for numeric histograms

Zhenhua Wang created SPARK-18000:
------------------------------------

             Summary: Aggregation function for computing endpoints for numeric histograms
                 Key: SPARK-18000
                 URL: https://issues.apache.org/jira/browse/SPARK-18000
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: Zhenhua Wang


For a column of numeric type (including date and timestamp), we will generate a equi-width or equi-height histogram, depending on if its ndv is large than the maximum number of bins allowed in one histogram (denoted as numBins).
This agg function computes values and their frequencies using a small hashmap, whose size is less than or equal to "numBins", and returns an equi-width histogram. 
When the size of hashmap exceeds "numBins", it cleans the hashmap and utilizes ApproximatePercentile to return endpoints of equi-height histogram.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org