You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2019/05/01 16:57:00 UTC

[jira] [Updated] (DRILL-7228) Histogram end points show high deviation for a sample data set

     [ https://issues.apache.org/jira/browse/DRILL-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aman Sinha updated DRILL-7228:
------------------------------
      Labels: ready-to-commit  (was: )
    Reviewer: Gautam Parai

> Histogram end points show high deviation for a sample data set
> --------------------------------------------------------------
>
>                 Key: DRILL-7228
>                 URL: https://issues.apache.org/jira/browse/DRILL-7228
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.16.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>            Priority: Major
>              Labels: ready-to-commit
>             Fix For: 1.17.0
>
>
> There are couple of scenarios where the histogram bucket end points show high deviation for the attached sample data set. 
> +Scenario 1: +
> There are total 100 rows in the sample. Here are the first 10 values of the c_float column ordered by the column.  
> {noformat}
> select c_float from `table_stats/alltypes_with_nulls` order by c_float;
> +--------------+
> |   c_float    |
> +--------------+
> | -4.6873795E9 |
> | 8.1855632E7  |
> | 2.65311632E8 |
> | 4.50677952E8 |
> | 4.6864464E8  |
> | 5.7848493E8  |
> | 6.6793114E8  |
> | 7.1175571E8  |
> | 9.0065581E8  |
> | 9.2245773E8  |
> ...
> ...
> <100 rows>
> {noformat}
> Here the minimum value is a small negative number.  Here's the output of the histogram after running ANALYZE command: 
> {noformat}
>  "buckets" : [ 8.1855488E7, 9.13736816E8, 1.7208630111999998E9, 3.2401755232E9, 4.6546719328E9, 5.130497904E9, 5.9901393504E9, 6.779930992E9, 7.998626672E9, 8.691596143999998E9, 9.983783792E9 ]
> {noformat}
> Note that the starting end point of bucket 0 is actually the 2nd value in the ordered list and the small negative number is not represented in the histogram at all. 
> +Scenario 2:+
> Histogram for the c_bigint column is as below: 
> {noformat}
>  {
>       "column" : "`c_bigint`",
>       "majortype" : {
>         "type" : "BIGINT",
>         "mode" : "OPTIONAL"
>       },
>       "schema" : 1.0,
>       "rowcount" : 100.0,
>       "nonnullrowcount" : 87.0,
>       "ndv" : 46,
>       "avgwidth" : 8.0,
>       "histogram" : {
>         "category" : "numeric-equi-depth",
>         "numRowsPerBucket" : 8,
>         "buckets" : [ -8.6390506354062131E18, -7.679478802017577E18, -5.8389791200382024E18, -2.9165328693138038E18, -1.77746633649836621E18, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 8.848383132345303E17, 4.6441480083157811E18 ]
>       }
>     }
> {noformat}
> This indicates that there are duplicate rows with the value close to 2.83 which is not true when we analyze the source data.
> This is the output of the ntile function:
> {noformat}
> SELECT bucket_num,
>  min(c_bigint) as min_amount,
> max(c_bigint) as max_amount,
>  count(*) as total_count
>    FROM (
>      SELECT c_bigint,
>          NTILE(10) OVER (ORDER BY c_bigint) as bucket_num
>          FROM `table_stats/alltypes_with_nulls`
>   )
> GROUP BY bucket_num
>  ORDER BY bucket_num;
> +------------+----------------------+----------------------+-------------+
> | bucket_num |      min_amount      |      max_amount      | total_count |
> +------------+----------------------+----------------------+-------------+
> | 1          | -8804872880253829120 | -6983033704176156672 | 10          |
> | 2          | -6772904422084182016 | -5326061597989273600 | 10          |
> | 3          | -5111449881868763136 | -2561061038367703040 | 10          |
> | 4          | -2424523650070740992 | -449093763428515840  | 10          |
> | 5          | 0                    | 0                    | 10          |
> | 6          | 0                    | 0                    | 10          |
> | 7          | 0                    | 0                    | 10          |
> | 8          | 0                    | 884838034226544640   | 10          |
> | 9          | 884838034226544640   | 4644147690488201216  | 10          |
> | 10         | null                 | null                 | 10          |
> +------------+----------------------+----------------------+-------------+
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)