You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Thomas Tauber-Marshall (JIRA)" <ji...@apache.org> on 2018/01/04 18:02:00 UTC

[jira] [Resolved] (IMPALA-6295) Inconsistent handling of 'nan' and 'inf' with min/max analytic fns

     [ https://issues.apache.org/jira/browse/IMPALA-6295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Tauber-Marshall resolved IMPALA-6295.
--------------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.12.0

commit 96b976aff38de29619ca97dabc47566382e90bf8
Author: Thomas Tauber-Marshall <tm...@cloudera.com>
Date:   Thu Dec 14 16:40:55 2017 -0800

    IMPALA-6295: Fix mix/max handling of 'nan' and 'inf'
    
    This patch fixes several issues related to the min/max aggregate
    functions and their handling of 'nan' and 'inf':
    - Previously, if 'inf' or '-inf' was the only value for the min/max
      and codegen was being used, the result would be incorrect. This
      occurred, for example in the case of 'inf' and 'min', because we
      set an initial value of numeric_limits::max, which is less than
      'inf', so the returned min was numeric_limits::max when it should be
      'inf'. The fix is to set the initial value to
      numeric_limits::infinity.
    - Previously, if one of the values was 'nan', the result of min/max
      was non-deterministic depending on the order the values were
      evaluated in. This occurs because 'nan' < or > 'any value' is always
      false, so if the first value added was 'nan', all other comparisons
      would be false and 'nan' would be returned, whereas if the first
      value wasn't 'nan' then the 'nan' wouldn't be returned. The fix is
      to treat 'nan' specially and to always return 'nan' if there is a
      single 'nan' value.
    
    Testing:
    - Added e2e tests for both scenarios, as well as adding a little extra
      nan/inf coverage for other aggregate functions.
    
    Change-Id: Ia1e206105937ce5afc75ca5044597d39b3dc6a81
    Reviewed-on: http://gerrit.cloudera.org:8080/8854
    Reviewed-by: Bikramjeet Vig <bi...@cloudera.com>
    Reviewed-by: Tim Armstrong <ta...@cloudera.com>
    Tested-by: Impala Public Jenkins

> Inconsistent handling of 'nan' and 'inf' with min/max analytic fns
> ------------------------------------------------------------------
>
>                 Key: IMPALA-6295
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6295
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.11.0
>            Reporter: Thomas Tauber-Marshall
>            Assignee: Thomas Tauber-Marshall
>            Priority: Critical
>              Labels: codegen, correctness
>             Fix For: Impala 2.12.0
>
>
> Incorrect results are returned in some cases where 'nan'/'inf' are the only values in the group and codegen is enabled:
> {noformat}
> > set DISABLE_CODEGEN_ROWS_THRESHOLD set to 0
> > select * from test1 order by col1
> +------+-----------+
> | col0 | col1      |
> +------+-----------+
> | 0    | NaN       |
> | 2    | -Infinity |
> | 3    | 0         |
> | 1    | Infinity  |
> +------+-----------+
> > set DISABLE_CODEGEN set to true
> > select col0, min(col1) from test1 group by col0 order by col0
> +------+-----------+
> | col0 | min(col1) |
> +------+-----------+
> | 0    | NaN       |
> | 1    | Infinity  |
> | 2    | -Infinity |
> | 3    | 0         |
> +------+-----------+
> > set DISABLE_CODEGEN set to false
> > select col0, min(col1) from test1 group by col0 order by col0
> +------+------------------------+
> | col0 | min(col1)              |
> +------+------------------------+
> | 0    | 1.797693134862316e+308 |
> | 1    | 1.797693134862316e+308 |
> | 2    | -Infinity              |
> | 3    | 0                      |
> +------+------------------------+
> > set DISABLE_CODEGEN set to true
> > select col0, max(col1) from test1 group by col0 order by col0
> +------+-----------+
> | col0 | max(col1) |
> +------+-----------+
> | 0    | NaN       |
> | 1    | Infinity  |
> | 2    | -Infinity |
> | 3    | 0         |
> +------+-----------+
> > set DISABLE_CODEGEN set to false
> > select col0, max(col1) from test1 group by col0 order by col0
> +------+-------------------------+
> | col0 | max(col1)               |
> +------+-------------------------+
> | 0    | -1.797693134862316e+308 |
> | 1    | Infinity                |
> | 2    | -1.797693134862316e+308 |
> | 3    | 0                       |
> +------+-------------------------+
> {noformat}
> We also appear to never return 'nan' as a min or max value despite sorted it as the lowest value when ordering a table (perhaps this is the intended behavior?):
> {noformat}
> > set DISABLE_CODEGEN_ROWS_THRESHOLD set to 0
> > select * from test2 order by col1
> +------+-----------+
> | col0 | col1      |
> +------+-----------+
> | 0    | NaN       |
> | 2    | -Infinity |
> | 0    | 0         |
> | 3    | 0         |
> | 1    | 1         |
> | 2    | 2         |
> | 3    | 3         |
> | 1    | Infinity  |
> +------+-----------+
> > set DISABLE_CODEGEN set to true
> > select col0, min(col1) from test2 group by col0 order by col0
> +------+-----------+
> | col0 | min(col1) |
> +------+-----------+
> | 0    | 0         |
> | 1    | 1         |
> | 2    | -Infinity |
> | 3    | 0         |
> +------+-----------+
> > set DISABLE_CODEGEN set to false
> > select col0, min(col1) from test2 group by col0 order by col0
> +------+-----------+
> | col0 | min(col1) |
> +------+-----------+
> | 0    | 0         |
> | 1    | 1         |
> | 2    | -Infinity |
> | 3    | 0         |
> +------+-----------+
> > set DISABLE_CODEGEN set to true
> > select col0, max(col1) from test2 group by col0 order by col0
> +------+-----------+
> | col0 | max(col1) |
> +------+-----------+
> | 0    | 0         |
> | 1    | Infinity  |
> | 2    | 2         |
> | 3    | 3         |
> +------+-----------+
> > set DISABLE_CODEGEN set to false
> > select col0, max(col1) from test2 group by col0 order by col0
> +------+-----------+
> | col0 | max(col1) |
> +------+-----------+
> | 0    | 0         |
> | 1    | Infinity  |
> | 2    | 2         |
> | 3    | 3         |
> +------+-----------+
> {noformat}
> Changing LlvmCodeGen::CodegenMinMax to use OLT/OGT float comparison functions appears to solve the first case (at least for 'nan'), but leads to us returning 'nan' as a max value in the second case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)