You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Zoltán Borók-Nagy (JIRA)" <ji...@apache.org> on 2018/03/08 17:29:00 UTC

[jira] [Resolved] (IMPALA-6527) NaN values lead to incorrect filtering under certain circumstances

     [ https://issues.apache.org/jira/browse/IMPALA-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zoltán Borók-Nagy resolved IMPALA-6527.
---------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.12.0
                   Impala 3.0

Write path is also fixed: https://github.com/apache/impala/commit/5d044e0cb201d975bc2af478501bd968407e1962

> NaN values lead to incorrect filtering under certain circumstances
> ------------------------------------------------------------------
>
>                 Key: IMPALA-6527
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6527
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.11.0
>            Reporter: Zoltan Ivanfi
>            Assignee: Zoltán Borók-Nagy
>            Priority: Blocker
>              Labels: correctness, parquet
>             Fix For: Impala 3.0, Impala 2.12.0
>
>
> h1. Summary
> If the first number in a row group written by Impala is NaN, then Impala writes incorrect statistics in the metadata. This will result in incorrect results when filtering the data.
> h1. Reproduction
> First, create a Parquet table with a double column:
> {noformat}
> create table test_nan(val double) stored as parquet;
> {noformat}
> Insert two values in a single statement, the first of which is a NaN:
> {noformat}
> insert into test_nan values (cast('NaN' as double)), (42);
> {noformat}
> Check that both values are actually present in the table:
> {noformat}
> select * from test_nan;
> +-----+
> | val |
> +-----+
> | NaN |
> | 42  |
> +-----+
> Fetched 2 row(s) in 0.13s
> {noformat}
> Filter using a condition that should match the regular number:
> {noformat}
> select * from test_nan where val > 0;
> Fetched 0 row(s) in 0.13s
> {noformat}
> *Expectation*: The row with the regular number should be returned.
>  *Actual result*: No rows are returned.
> h1. Explanation
> Parquet files contain statistics metadata including the fields {{min}} and {{max}} or {{min_value}} and {{max_value}} (depending on the Impala version). If the first number is a NaN, the minimum and maximum values that Impala writes in the metadata are NaN. Based on this metadata, the row group can not contain any value that matches the condition, thereby Impala discards its contents without checking the individual entries. The problem is that the statistics were incorrectly written in the first place. (This can be and has been checked by using {{parquet-tools meta}} on the Parquet file.)
> What follows are just my assumptions without checking the actual code: While writing data, Impala keeps track of the smallest and largest value encountered so far. Let's call them min_so_far and max_so_far, respectively.
> Initially, the first (non_NULL) value is set as both the min_so_far and max_so_far. Then each new value is compared against min_so_far and max_so_far, updating each one if necessary. In pseudo_code:
> {code:java}
> if (new_value < min_so_far) {
>   min_so_far = new_value;
> }
> {code}
> The problem is that any comparison involving NaN returns false, thereby if NaN is already in min_so_far, then no value can ever replace it and NaN will be stuck there.
> On the positive side, min_so_far can only become NaN if the first value in the row group is NaN. If the first value is not NaN, then NaN can never replace min_so_far, since the comparison will always return false when it involves a NaN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)