You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Zoltán Borók-Nagy (JIRA)" <ji...@apache.org> on 2018/02/22 11:59:00 UTC

[jira] [Resolved] (IMPALA-6538) Fix read path when Parquet min(_value)/max(_value) statistics contain NaN

     [ https://issues.apache.org/jira/browse/IMPALA-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zoltán Borók-Nagy resolved IMPALA-6538.
---------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.12.0
                   Impala 3.0

Fixed by https://github.com/apache/impala/commit/881e00a8bff0469ab7860bcd0d4d4794fb04a4b8

> Fix read path when Parquet min(_value)/max(_value) statistics contain NaN
> -------------------------------------------------------------------------
>
>                 Key: IMPALA-6538
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6538
>             Project: IMPALA
>          Issue Type: Sub-task
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>             Fix For: Impala 3.0, Impala 2.12.0
>
>
> (I'll only write min and max, but I'll also mean min_value and max_value by that)
> When both min and max is NaN:
>  * Written by Impala:
>  ** first element in the row group is NaN, but not all of them (Impala writer bug)
>  ** all element is NaN
>  * Written by Hive/Parquet-mr:
>  ** all element is NaN
> Either min or max is NaN, but not both:
>  * Written by Impala:
>  ** this cannot happen currently
>  * Written by Hive/Parquet-mr:
>  ** only the max can be NaN (needs to be checked)
> Therefore, if both min and max is NaN, we can't use the statistics for filtering.
> If only the max is NaN, we still have a valid lower bound.
>  
> A workaround can be to change the NaNs to infinities, ie. max => Inf, min => -Inf
> Based on my experiments, min/max statistics are not applied to predicates that can be true for NaN, e.g. 'NOT x < 3'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)