You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (JIRA)" <ji...@apache.org> on 2019/06/12 18:14:00 UTC

[jira] [Commented] (IMPALA-8657) Aggr functions return nonreasonable NaN in Analytic query

    [ https://issues.apache.org/jira/browse/IMPALA-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862334#comment-16862334 ] 

Tim Armstrong commented on IMPALA-8657:
---------------------------------------

I know exactly what's happening here. 

Unfortunately I think it's a fairly fundamental issue with the way analytic functions are implemented and the properties of floating point numbers. Maybe there's a solution for a bunch of special cases, but not for all the classes of problems.

A lot of analytic functions for sliding windows are implemented with Update()/Remove() - a single intermediate state is tracked and values are added and removed as the window removes. This achieve the same effect as evaluating the function over the values in the window iff the Remove() operation can precisely undo the Update() operation (that's not a precise mathematical definition, but you get the idea).  Floating point operations do not have the required properties, e.g. sometimes (a - b + b)  != a. E.g.:

* If b == NaN or Inf/-Inf, it's not reversible
* If there is some loss of precision, as part of the addition, then you might not get the exact result back.

We could probably work around the NaN and Inf/-Inf problems by tracking the number of NaN/inf values in the intermediate state instead of just naively adding or subtracting the NaN/info. The trickier issues are a bit less glaring but trickier.

One way to solve this is to use fixed-precision DECIMAL. That avoids all these headaches.

> Aggr functions return nonreasonable NaN in Analytic query 
> ----------------------------------------------------------
>
>                 Key: IMPALA-8657
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8657
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.8.0, Impala 2.7.1, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0, Impala 2.12.0, Impala 3.1.0, Impala 3.2.0
>            Reporter: Xiaomin Zhang
>            Priority: Major
>              Labels: correctness
>         Attachments: NaN_diff.txt, data.csv, test.sql
>
>
> When dataset (attached as data.csv) has +/-Inf or NaN rows, Impala could return nonreasonable NaN in analytic queries like below:
> SELECT 
>  customerid
>  , num_doc
>  , fecha_eeff
>  , patrimonio
>  , FIRST_VALUE(patrimonio) OVER(PARTITION BY customerid, num_doc ORDER BY fecha_eeff ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS primer_valor
>  , LAST_VALUE(patrimonio) OVER(PARTITION BY customerid, num_doc ORDER BY fecha_eeff ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS segundo_valor
>  , AVG(patrimonio) OVER(PARTITION BY customerid, num_doc ORDER BY fecha_eeff ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS patrimonio_avg
> FROM base_case_avg
> ORDER BY fecha_eeff
> +-------------+-------------+------------+--------------------+--------------------+--------------------+---------------------+
> | customerid | num_doc | fecha_eeff | patrimonio | primer_valor | segundo_valor | patrimonio_avg |
> +-------------+-------------+------------+--------------------+--------------------+--------------------+---------------------+
> | -2147483648 | -2147483648 | 20081231 | NULL | NULL | NULL | NULL |
> | -2147483648 | -2147483648 | 20091231 | NULL | NULL | NULL | NULL |
> | -2147483648 | -2147483648 | 20101231 | -26.71346658 | NULL | -26.71346658 | -26.71346658 |
> | -2147483648 | -2147483648 | 20111231 | 25.20101545 | -26.71346658 | 25.20101545 | -0.7562255649999994 |
> | -2147483648 | -2147483648 | 20121231 | Infinity | 25.20101545 | Infinity | Infinity |
> | -2147483648 | -2147483648 | 20131231 | NaN | Infinity | NaN | NaN |
> | -2147483648 | -2147483648 | 20141231 | -Infinity | NaN | -Infinity | NaN |
> | -2147483648 | -2147483648 | 20151231 | 640.7372572 | -Infinity | 640.7372572 | NaN |
> | -2147483648 | -2147483648 | 20160630 | -725.6532577 | 640.7372572 | -725.6532577 | NaN |
> | -2147483648 | -2147483648 | 20160930 | -618.0659668 | -725.6532577 | -618.0659668 | NaN |
> | -2147483648 | -2147483648 | 20161231 | -736.5709632000001 | -618.0659668 | -736.5709632000001 | NaN |
> | -2147483648 | -2147483648 | 20170630 | 89.46417653 | -736.5709632000001 | 89.46417653 | NaN |
> | -2147483648 | -2147483648 | 20171231 | Infinity | 89.46417653 | Infinity | NaN |
> | -2147483648 | -2147483648 | 20180630 | NaN | Infinity | NaN | NaN |
> | -2147483648 | -2147483648 | 20180930 | -Infinity | NaN | -Infinity | NaN |
> | -2147483648 | -2147483648 | 20181231 | NaN | -Infinity | NaN | NaN |
> +-------------+-------------+------------+--------------------+--------------------+--------------------+---------------------+
> AVG returns NaN or Infinity correctly for the first 8 rows, however for the rest, Impala always returns NaN even for the valid numbers:
> |-2147483648|-2147483648|20160630|-725.6532577|640.7372572|-725.6532577|NaN|
>  
> I understand NaN (and +Inf/-Inf) handling is not a SQL standard, hence resulting quite different implementations. As a comparison, I tested the same query with other RDBMS and Hive, and the results are more reasonable (attached as NaN_diff.txt) though they are also quite different:
> 1) MySQL (v8.0.16, with WINDOW support) treats Inf/NaN as 0
> 2) PostgreSQL follows the IEEE-754, and produced result confirming to C library [1]
> 3) Hive does not distinguish between Infinity and NaNs. But unlike MySQL, Hive treats them as NULL, hence not counted in AVG and other aggregation functions.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org