You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Zoltán Borók-Nagy (JIRA)" <ji...@apache.org> on 2018/02/19 13:09:00 UTC

[jira] [Created] (IMPALA-6538) Fix read path when Parquet min(_value)/max(_value) statistics contain NaN

Zoltán Borók-Nagy created IMPALA-6538:
-----------------------------------------

             Summary: Fix read path when Parquet min(_value)/max(_value) statistics contain NaN
                 Key: IMPALA-6538
                 URL: https://issues.apache.org/jira/browse/IMPALA-6538
             Project: IMPALA
          Issue Type: Sub-task
            Reporter: Zoltán Borók-Nagy


(I'll only write min and max, but I'll also mean min_value and max_value by that)

When both min and max is NaN:
 * Written by Impala:
 ** first element in the row group is NaN, but not all of them (Impala writer bug)
 ** all element is NaN
 * Written by Hive/Parquet-mr:
 ** all element is NaN

Either min or max is NaN, but not both:
 * Written by Impala:
 ** this cannot happen currently
 * Written by Hive/Parquet-mr:
 ** only the max can be NaN (needs to be checked)

Therefore, if both min and max is NaN, we can't use the statistics for filtering.

If only the max is NaN, we still have a valid lower bound.

 

A workaround can be to change the NaNs to infinities, ie. max => Inf, min => -Inf

Based on my experiments, min/max statistics are not applied to predicates that can be true for NaN, e.g. 'NOT x < 3'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)