You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Zoltán Borók-Nagy (JIRA)" <ji...@apache.org> on 2018/02/20 16:10:00 UTC

[jira] [Created] (IMPALA-6542) Fix inconsistent write path of Parquet min/max statistics

Zoltán Borók-Nagy created IMPALA-6542:
-----------------------------------------

             Summary: Fix inconsistent write path of Parquet min/max statistics
                 Key: IMPALA-6542
                 URL: https://issues.apache.org/jira/browse/IMPALA-6542
             Project: IMPALA
          Issue Type: Sub-task
            Reporter: Zoltán Borók-Nagy


If the first value of a column chunk is NaN, then mix_value = max_value = NaN.

If the first value of a column chunk is not NaN, i.e. it is an ordinary number or +/-infinity, then in the end min_value != NaN and max_value != NaN.

 

Until the Parquet community doesn't agree on the ordering of floating point numbers, we can make our write path consistent.

A quick fix is to ignore NaNs when calculating min/max statistics, except for the case when all the values are NaN.

This way we can use min/max statistics and still the results remain correct, because only binary predicates that contain constants are tested against min/max statistics. In other words, if we want to get NaNs back by a predicate (e.g. 'NOT x < 3', 'x != x'), min/max statistics won't be used, ie. we will get the NaNs as well.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)