You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2019/04/30 21:57:00 UTC
[jira] [Commented] (DRILL-7228) Histogram end points show high deviation for a sample data set

    [ https://issues.apache.org/jira/browse/DRILL-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830725#comment-16830725 ] 

ASF GitHub Bot commented on DRILL-7228:
---------------------------------------

amansinha100 commented on pull request #1774: DRILL-7228: Upgrade to a newer version of t-digest to address inaccur…
URL: https://github.com/apache/drill/pull/1774
 
 
   …acies in histogram buckets.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Histogram end points show high deviation for a sample data set
> --------------------------------------------------------------
>
>                 Key: DRILL-7228
>                 URL: https://issues.apache.org/jira/browse/DRILL-7228
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.16.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>            Priority: Major
>             Fix For: 1.17.0
>
>
> There are couple of scenarios where the histogram bucket end points show high deviation for the attached sample data set. 
> +Scenario 1: +
> There are total 100 rows in the sample. Here are the first 10 values of the c_float column ordered by the column.  
> {noformat}
> select c_float from `table_stats/alltypes_with_nulls` order by c_float;
> +--------------+
> |   c_float    |
> +--------------+
> | -4.6873795E9 |
> | 8.1855632E7  |
> | 2.65311632E8 |
> | 4.50677952E8 |
> | 4.6864464E8  |
> | 5.7848493E8  |
> | 6.6793114E8  |
> | 7.1175571E8  |
> | 9.0065581E8  |
> | 9.2245773E8  |
> ...
> ...
> <100 rows>
> {noformat}
> Here the minimum value is a small negative number.  Here's the output of the histogram after running ANALYZE command: 
> {noformat}
>  "buckets" : [ 8.1855488E7, 9.13736816E8, 1.7208630111999998E9, 3.2401755232E9, 4.6546719328E9, 5.130497904E9, 5.9901393504E9, 6.779930992E9, 7.998626672E9, 8.691596143999998E9, 9.983783792E9 ]
> {noformat}
> Note that the starting end point of bucket 0 is actually the 2nd value in the ordered list and the small negative number is not represented in the histogram at all. 
> +Scenario 2:+
> Histogram for the c_bigint column is as below: 
> {noformat}
>  {
>       "column" : "`c_bigint`",
>       "majortype" : {
>         "type" : "BIGINT",
>         "mode" : "OPTIONAL"
>       },
>       "schema" : 1.0,
>       "rowcount" : 100.0,
>       "nonnullrowcount" : 87.0,
>       "ndv" : 46,
>       "avgwidth" : 8.0,
>       "histogram" : {
>         "category" : "numeric-equi-depth",
>         "numRowsPerBucket" : 8,
>         "buckets" : [ -8.6390506354062131E18, -7.679478802017577E18, -5.8389791200382024E18, -2.9165328693138038E18, -1.77746633649836621E18, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 8.848383132345303E17, 4.6441480083157811E18 ]
>       }
>     }
> {noformat}
> This indicates that there are duplicate rows with the value close to 2.83 which is not true when we analyze the source data.
> This is the output of the ntile function:
> {noformat}
> SELECT bucket_num,
>  min(c_bigint) as min_amount,
> max(c_bigint) as max_amount,
>  count(*) as total_count
>    FROM (
>      SELECT c_bigint,
>          NTILE(10) OVER (ORDER BY c_bigint) as bucket_num
>          FROM `table_stats/alltypes_with_nulls`
>   )
> GROUP BY bucket_num
>  ORDER BY bucket_num;
> +------------+----------------------+----------------------+-------------+
> | bucket_num |      min_amount      |      max_amount      | total_count |
> +------------+----------------------+----------------------+-------------+
> | 1          | -8804872880253829120 | -6983033704176156672 | 10          |
> | 2          | -6772904422084182016 | -5326061597989273600 | 10          |
> | 3          | -5111449881868763136 | -2561061038367703040 | 10          |
> | 4          | -2424523650070740992 | -449093763428515840  | 10          |
> | 5          | 0                    | 0                    | 10          |
> | 6          | 0                    | 0                    | 10          |
> | 7          | 0                    | 0                    | 10          |
> | 8          | 0                    | 884838034226544640   | 10          |
> | 9          | 884838034226544640   | 4644147690488201216  | 10          |
> | 10         | null                 | null                 | 10          |
> +------------+----------------------+----------------------+-------------+
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)