You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2022/12/07 14:09:00 UTC
[jira] [Commented] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down
[ https://issues.apache.org/jira/browse/ARROW-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644354#comment-17644354 ]
Antoine Pitrou commented on ARROW-12264:
----------------------------------------
cc @westonpace
> [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down
> -------------------------------------------------------------------
>
> Key: ARROW-12264
> URL: https://issues.apache.org/jira/browse/ARROW-12264
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Parquet
> Reporter: Antoine Pitrou
> Priority: Major
>
> The Parquet spec (in parquet.thrift) says the following about handling of floating-point statistics:
> {code}
> * (*) Because the sorting order is not specified properly for floating
> * point values (relations vs. total ordering) the following
> * compatibility rules should be applied when reading statistics:
> * - If the min is a NaN, it should be ignored.
> * - If the max is a NaN, it should be ignored.
> * - If the min is +0, the row group may contain -0 values as well.
> * - If the max is -0, the row group may contain +0 values as well.
> * - When looking for NaN values, min and max should be ignored.
> {code}
> It appears that the dataset code uses the following filter expression when doing Parquet predicate push-down (in {{file_parquet.cc}}):
> {code:c++}
> return and_(greater_equal(field_expr, literal(min)),
> less_equal(field_expr, literal(max)));
> {code}
> A NaN value will fail that filter and yet may be found in the given Parquet column chunk.
> We may instead need a "greater_equal_or_nan" comparison that returns true if either value is NaN.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)