You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2019/05/01 16:57:00 UTC
[jira] [Updated] (DRILL-7228) Histogram end points show high
deviation for a sample data set
[ https://issues.apache.org/jira/browse/DRILL-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aman Sinha updated DRILL-7228:
------------------------------
Labels: ready-to-commit (was: )
Reviewer: Gautam Parai
> Histogram end points show high deviation for a sample data set
> --------------------------------------------------------------
>
> Key: DRILL-7228
> URL: https://issues.apache.org/jira/browse/DRILL-7228
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.16.0
> Reporter: Aman Sinha
> Assignee: Aman Sinha
> Priority: Major
> Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> There are couple of scenarios where the histogram bucket end points show high deviation for the attached sample data set.
> +Scenario 1: +
> There are total 100 rows in the sample. Here are the first 10 values of the c_float column ordered by the column.
> {noformat}
> select c_float from `table_stats/alltypes_with_nulls` order by c_float;
> +--------------+
> | c_float |
> +--------------+
> | -4.6873795E9 |
> | 8.1855632E7 |
> | 2.65311632E8 |
> | 4.50677952E8 |
> | 4.6864464E8 |
> | 5.7848493E8 |
> | 6.6793114E8 |
> | 7.1175571E8 |
> | 9.0065581E8 |
> | 9.2245773E8 |
> ...
> ...
> <100 rows>
> {noformat}
> Here the minimum value is a small negative number. Here's the output of the histogram after running ANALYZE command:
> {noformat}
> "buckets" : [ 8.1855488E7, 9.13736816E8, 1.7208630111999998E9, 3.2401755232E9, 4.6546719328E9, 5.130497904E9, 5.9901393504E9, 6.779930992E9, 7.998626672E9, 8.691596143999998E9, 9.983783792E9 ]
> {noformat}
> Note that the starting end point of bucket 0 is actually the 2nd value in the ordered list and the small negative number is not represented in the histogram at all.
> +Scenario 2:+
> Histogram for the c_bigint column is as below:
> {noformat}
> {
> "column" : "`c_bigint`",
> "majortype" : {
> "type" : "BIGINT",
> "mode" : "OPTIONAL"
> },
> "schema" : 1.0,
> "rowcount" : 100.0,
> "nonnullrowcount" : 87.0,
> "ndv" : 46,
> "avgwidth" : 8.0,
> "histogram" : {
> "category" : "numeric-equi-depth",
> "numRowsPerBucket" : 8,
> "buckets" : [ -8.6390506354062131E18, -7.679478802017577E18, -5.8389791200382024E18, -2.9165328693138038E18, -1.77746633649836621E18, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 8.848383132345303E17, 4.6441480083157811E18 ]
> }
> }
> {noformat}
> This indicates that there are duplicate rows with the value close to 2.83 which is not true when we analyze the source data.
> This is the output of the ntile function:
> {noformat}
> SELECT bucket_num,
> min(c_bigint) as min_amount,
> max(c_bigint) as max_amount,
> count(*) as total_count
> FROM (
> SELECT c_bigint,
> NTILE(10) OVER (ORDER BY c_bigint) as bucket_num
> FROM `table_stats/alltypes_with_nulls`
> )
> GROUP BY bucket_num
> ORDER BY bucket_num;
> +------------+----------------------+----------------------+-------------+
> | bucket_num | min_amount | max_amount | total_count |
> +------------+----------------------+----------------------+-------------+
> | 1 | -8804872880253829120 | -6983033704176156672 | 10 |
> | 2 | -6772904422084182016 | -5326061597989273600 | 10 |
> | 3 | -5111449881868763136 | -2561061038367703040 | 10 |
> | 4 | -2424523650070740992 | -449093763428515840 | 10 |
> | 5 | 0 | 0 | 10 |
> | 6 | 0 | 0 | 10 |
> | 7 | 0 | 0 | 10 |
> | 8 | 0 | 884838034226544640 | 10 |
> | 9 | 884838034226544640 | 4644147690488201216 | 10 |
> | 10 | null | null | 10 |
> +------------+----------------------+----------------------+-------------+
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)