You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Zoltan Ivanfi <zi...@cloudera.com> on 2018/02/15 16:10:58 UTC

Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Dear Parquet and Impala Developers,

We have exposed min/max statistics to extensive compatibility testing and
found troubling inconsistencies regarding float and double values. Under
certain (fortunately rather extreme) circumstances, this can lead to
predicate pushdown incorrectly discarding row groups that contain matching
rows.

The root of the problem seems to be that Impala (and probably parquet-cpp
as well) uses C++ comparison operators for floating point numbers and those
do not provide a total ordering. This is actually in line with IEEE 754,
according to which -0 is neither less nor more than +0 and comparing NaN to
anything always returns false. This, however is not suitable for statistics
and can lead to serious consequences that you can read more about in
IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>.

The IEEE 754 standard and the Java API, on the other hand, both provide a
total ordering, but I'm not sure whether the two are the same. The java
implementation looks relatively simple - both easy to understand and
effective to execute. You can check it here
<http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999>.
The IEEE 754 total ordering, on the other hand looks rather complicated to
the extent that I can not decide whether the Java implementation adheres to
it. I couldn't find the whole standard online, but I found an excerpt about
the totalOrder predicate here
<https://github.com/rust-lang/rust/issues/5585>. Additionally, I have also
found that IEEE 754-2008 defines min and max operations as described here
<https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that
strangely *do not* adhere to a total ordering.

I checked the specification in parquet-format but all I could find about
floating point numbers is the following:

* FLOAT - signed comparison of the represented value
* DOUBLE - signed comparison of the represented value

I suggest extending the specification to explicitly require implementations
to follow a specific comparison logic for these types. The candidates are:

- The Java implementation
<http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999>
which looks easy and efficient to implement in any language.
- The IEEE 754 totalOrder <https://github.com/rust-lang/rust/issues/5585>
predicate which honestly looks rather scary.
- The IEEE 754-2008 min and max
<https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> operations
which may be hard to use for comparison.

I'm curious to hear your opinions.

Thanks,

Zoltan

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Alexander Behm <al...@cloudera.com>.

On Fri, Feb 16, 2018 at 9:15 AM, Tim Armstrong <ta...@cloudera.com>
wrote:

> I don't see a major benefit to a temporary solution. The files are already
> out there and we need to implement a fix on the read path regardless. If we
> keep writing the stats there's at least some information contained in the
> stats that readers can make use of, if they want to implement the required
> logic.
>
> Dropping stats if an NaN is encountered also doesn't really address the
> other side of the problem - an absence of a NaN in the stats doesn't imply
> an absence of a NaN in the data, so the reader can't do anything useful
> with the stats anyway unless it's NaN-aware.
>

The writer solution is to only write stats if the data does not contain
special values (common case).

>
> On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <al...@cloudera.com>
> wrote:
>
> > I hope the common cases is that data files do not contain these special
> > float values. As the simplest solution, how about writers refrain from
> > populating the stats if a special value is encountered?
> >
> > That fix does not preclude a more thorough solution in the future, but it
> > addresses the common case quickly.
> >
> > For existing data files we could check the writer version ignore filters
> on
> > float/double. I don't know whether min/max filtering is common on
> > float/double, but I suspect it's not.
> >
> > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <ta...@cloudera.com>
> > wrote:
> >
> > > There is an extensibility mechanism with the ColumnOrder union - I
> think
> > > that was meant to avoid the need to add new stat fields?
> > >
> > > Given that the bug was in the Parquet spec, we'll need to make a spec
> > > change anyway, so we could add a new ColumnOrder -
> > FloatingPointTotalOrder?
> > > at the same time as fixing the gap in the spec.
> > >
> > > It could make sense to declare that the default ordering for
> > floats/doubles
> > > is not NaN-aware (i.e. the reader should assume that NaN was
> arbitrarily
> > > ordered) and readers should either implement the required logic to
> handle
> > > that correctly (I had some ideas here:
> > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > or ignore the stats.
> > >
> > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com>
> wrote:
> > >
> > > > > We could have a similar problem
> > > > > with not finding +0.0 values because a -0.0 is written to the
> > max_value
> > > > > field by some component that considers them the same.
> > > >
> > > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > > under the real-number-inspired ordering, which is distinguished from
> > > > total Ordering, and which is also what you get when you use the
> > > > default C/C++ operators <, >, <=, ==, and so on.
> > > >
> > > > You can distinguish between -0.0 and +0.0 without using total
> ordering
> > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > > at the moment.
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Alexander Behm <al...@cloudera.com>.

On Fri, Feb 16, 2018 at 9:15 AM, Tim Armstrong <ta...@cloudera.com>
wrote:

> I don't see a major benefit to a temporary solution. The files are already
> out there and we need to implement a fix on the read path regardless. If we
> keep writing the stats there's at least some information contained in the
> stats that readers can make use of, if they want to implement the required
> logic.
>
> Dropping stats if an NaN is encountered also doesn't really address the
> other side of the problem - an absence of a NaN in the stats doesn't imply
> an absence of a NaN in the data, so the reader can't do anything useful
> with the stats anyway unless it's NaN-aware.
>

The writer solution is to only write stats if the data does not contain
special values (common case).

>
> On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <al...@cloudera.com>
> wrote:
>
> > I hope the common cases is that data files do not contain these special
> > float values. As the simplest solution, how about writers refrain from
> > populating the stats if a special value is encountered?
> >
> > That fix does not preclude a more thorough solution in the future, but it
> > addresses the common case quickly.
> >
> > For existing data files we could check the writer version ignore filters
> on
> > float/double. I don't know whether min/max filtering is common on
> > float/double, but I suspect it's not.
> >
> > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <ta...@cloudera.com>
> > wrote:
> >
> > > There is an extensibility mechanism with the ColumnOrder union - I
> think
> > > that was meant to avoid the need to add new stat fields?
> > >
> > > Given that the bug was in the Parquet spec, we'll need to make a spec
> > > change anyway, so we could add a new ColumnOrder -
> > FloatingPointTotalOrder?
> > > at the same time as fixing the gap in the spec.
> > >
> > > It could make sense to declare that the default ordering for
> > floats/doubles
> > > is not NaN-aware (i.e. the reader should assume that NaN was
> arbitrarily
> > > ordered) and readers should either implement the required logic to
> handle
> > > that correctly (I had some ideas here:
> > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > or ignore the stats.
> > >
> > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com>
> wrote:
> > >
> > > > > We could have a similar problem
> > > > > with not finding +0.0 values because a -0.0 is written to the
> > max_value
> > > > > field by some component that considers them the same.
> > > >
> > > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > > under the real-number-inspired ordering, which is distinguished from
> > > > total Ordering, and which is also what you get when you use the
> > > > default C/C++ operators <, >, <=, ==, and so on.
> > > >
> > > > You can distinguish between -0.0 and +0.0 without using total
> ordering
> > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > > at the moment.
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Tim Armstrong <ta...@cloudera.com>.

I don't see a major benefit to a temporary solution. The files are already
out there and we need to implement a fix on the read path regardless. If we
keep writing the stats there's at least some information contained in the
stats that readers can make use of, if they want to implement the required
logic.

Dropping stats if an NaN is encountered also doesn't really address the
other side of the problem - an absence of a NaN in the stats doesn't imply
an absence of a NaN in the data, so the reader can't do anything useful
with the stats anyway unless it's NaN-aware.

On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <al...@cloudera.com>
wrote:

> I hope the common cases is that data files do not contain these special
> float values. As the simplest solution, how about writers refrain from
> populating the stats if a special value is encountered?
>
> That fix does not preclude a more thorough solution in the future, but it
> addresses the common case quickly.
>
> For existing data files we could check the writer version ignore filters on
> float/double. I don't know whether min/max filtering is common on
> float/double, but I suspect it's not.
>
> On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <ta...@cloudera.com>
> wrote:
>
> > There is an extensibility mechanism with the ColumnOrder union - I think
> > that was meant to avoid the need to add new stat fields?
> >
> > Given that the bug was in the Parquet spec, we'll need to make a spec
> > change anyway, so we could add a new ColumnOrder -
> FloatingPointTotalOrder?
> > at the same time as fixing the gap in the spec.
> >
> > It could make sense to declare that the default ordering for
> floats/doubles
> > is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
> > ordered) and readers should either implement the required logic to handle
> > that correctly (I had some ideas here:
> > https://issues.apache.org/jira/browse/IMPALA-6527?
> > focusedCommentId=16366106&page=com.atlassian.jira.
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > or ignore the stats.
> >
> > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com> wrote:
> >
> > > > We could have a similar problem
> > > > with not finding +0.0 values because a -0.0 is written to the
> max_value
> > > > field by some component that considers them the same.
> > >
> > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > under the real-number-inspired ordering, which is distinguished from
> > > total Ordering, and which is also what you get when you use the
> > > default C/C++ operators <, >, <=, ==, and so on.
> > >
> > > You can distinguish between -0.0 and +0.0 without using total ordering
> > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > at the moment.
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Alexander Behm <al...@cloudera.com>.

On Fri, Feb 16, 2018 at 9:38 AM, Tim Armstrong <ta...@cloudera.com>
wrote:

> The reader still can't correctly interpret those stats without knowing
> about the behaviour of that specific writer though, because it can't assume
> the absence of NaNs unless it knows that they are reading a file written by
> a writer that drops stats when it sees NaNs.
>
> It *could* fix the behaviour of some naive readers that don't correctly
> handle the current ambiguity in the specification, but I think those need
> to be fixed anyway because they will return wrong results for existing
> files.
>
> In the process of fixing the readers, you could then modify the readers so
> that they are aware of this special writer that drops stats with NaNs and
> knows that it is safe to use them, but I think those kind of shared
> reader-writer assumptions are essentially like having an unofficial
> extension of the Parquet spec.
>

Good point. I agree that summarizes the issues.

It is basically treating this issue like a bug in the writer (arguable). I
don't think we should require a rev in the Parquet spec to address bugs in
writers.
Working around bugs will always require checking the writer version in all
readers.

I'm certainly on board with the cleaner solution to adjust the spec - but
writers will always have bugs.

>
> On Fri, Feb 16, 2018 at 9:20 AM, Lars Volker <lv...@cloudera.com> wrote:
>
> > Yeah, I missed that. We set it per column, so all other types could keep
> > TypeDefinedOrder and floats could have something like
> NanAwareDoubleOrder.
> >
> > On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <ta...@cloudera.com>
> > wrote:
> >
> > > We wouldn't need to rev the whole TypeDefinedOrder thing right?
> Couldn't
> > we
> > > just define a special order for floats? Essentially it would be a tag
> for
> > > writers to say "hey I know about this total order thing".
> > >
> > > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <lv...@cloudera.com> wrote:
> > >
> > > > I think one idea behind the column order fields was that if a reader
> > does
> > > > not recognize a value there, it needs to ignore the stats. If I
> > remember
> > > > correctly, that was intended to allow us to add new orderings for
> > > > collations, but it also seems useful to address gaps in the spec or
> > known
> > > > broken readers. In this case we would need to deprecate the default
> > > > "TypeDefinedOrder" and replace it with something like
> > > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count
> > up,
> > > > like TypeDefinedOrderV2 and so on.
> > > >
> > > > An alternative would be to list all writers that are known to have
> > > written
> > > > incorrect stats. However that will not prevent old implementations to
> > > > misinterpret correct stats - which I think was the main reason why we
> > > added
> > > > new stats fields.
> > > >
> > > >
> > > >
> > > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <
> > alex.behm@cloudera.com>
> > > > wrote:
> > > >
> > > > > I hope the common cases is that data files do not contain these
> > special
> > > > > float values. As the simplest solution, how about writers refrain
> > from
> > > > > populating the stats if a special value is encountered?
> > > > >
> > > > > That fix does not preclude a more thorough solution in the future,
> > but
> > > it
> > > > > addresses the common case quickly.
> > > > >
> > > > > For existing data files we could check the writer version ignore
> > > filters
> > > > on
> > > > > float/double. I don't know whether min/max filtering is common on
> > > > > float/double, but I suspect it's not.
> > > > >
> > > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> > > tarmstrong@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > There is an extensibility mechanism with the ColumnOrder union -
> I
> > > > think
> > > > > > that was meant to avoid the need to add new stat fields?
> > > > > >
> > > > > > Given that the bug was in the Parquet spec, we'll need to make a
> > spec
> > > > > > change anyway, so we could add a new ColumnOrder -
> > > > > FloatingPointTotalOrder?
> > > > > > at the same time as fixing the gap in the spec.
> > > > > >
> > > > > > It could make sense to declare that the default ordering for
> > > > > floats/doubles
> > > > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > > > arbitrarily
> > > > > > ordered) and readers should either implement the required logic
> to
> > > > handle
> > > > > > that correctly (I had some ideas here:
> > > > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-
> 16366106)
> > > > > > or ignore the stats.
> > > > > >
> > > > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jbapple@cloudera.com
> >
> > > > wrote:
> > > > > >
> > > > > > > > We could have a similar problem
> > > > > > > > with not finding +0.0 values because a -0.0 is written to the
> > > > > max_value
> > > > > > > > field by some component that considers them the same.
> > > > > > >
> > > > > > > My hope is that the filtering would behave sanely, since -0.0
> ==
> > > +0.0
> > > > > > > under the real-number-inspired ordering, which is distinguished
> > > from
> > > > > > > total Ordering, and which is also what you get when you use the
> > > > > > > default C/C++ operators <, >, <=, ==, and so on.
> > > > > > >
> > > > > > > You can distinguish between -0.0 and +0.0 without using total
> > > > ordering
> > > > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some
> > other
> > > > > > > ways to distinguish, I suspect, but that's the simplest one I
> > > recall
> > > > > > > at the moment.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Tim Armstrong <ta...@cloudera.com>.

We could drop NaNs and require that -0 be normalised to +0 when writing out
stats. That would remove any degrees of freedom from the writer and then
straightforward comparison with =, <, >, >=, <=, != would work as expected.

On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Hi,
>
> Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222
> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
> solution.
>
> Alex, not writing min/max if there is a NaN is indeed a feasible quick-fix,
> but I think it would be better to just ignore NaN-s for the pruposes of
> min/max stats. For reading, we can ignore stats that contain a NaN. We also
> shouldn't use stats when looking for a NaN. -0 and +0 will still be
> problematic, though.
>
> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This
> function is not required to be sensitive to the sign of zero, although some
> implementations additionally enforce that if one argument is +0 and the
> other is -0, then +0 is returned." [1
> <http://en.cppreference.com/w/c/numeric/math/fmax>]
>
> Br,
>
> Zoltan
>
>
>
> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jb...@cloudera.com> wrote:
>
> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> > <bo...@cloudera.com> wrote:
> > > I would just like to mention that the fmax() / fmin() functions in
> C/C++
> > > Math library follow the aforementioned IEEE 754-2008 min and max
> > > specification:
> > > http://en.cppreference.com/w/c/numeric/math/fmax
> > >
> > > I think this behavior is also the most intuitive and useful regarding
> to
> > > statistics. If we want to select the max value, I think it's reasonable
> > to
> > > ignore nulls and not-numbers.
> >
> > It should be noted that this is different than the total ordering
> > predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
> > < +0.0 < positive numbers < +inf < +NaN
> >
> > fmax appears to be closest to IEEE-754's maxNum, but not quite
> > matching for some corner cases (-0.0, signalling NaN), but I'm not
> > 100% sure on that.
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Alexander Behm <al...@cloudera.com>.

Today, Impala does not evaluate "<col> != <constant>" against stats, but as
Zoltan pointed out there is a way to reasonably do that. It does not work
if we ignore NaN though, so we need to be careful.

On Tue, Feb 20, 2018 at 9:24 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:

> In parquet-mr, if you are looking for a value that is not equal to some
> reference value r and stats are min = r and max = r then that row group is
> discarded, because there can not be any other values in that row group.
>
> On Tue, Feb 20, 2018 at 6:21 PM Jim Apple <jb...@cloudera.com> wrote:
>
> > For that predicate in particular, does Impala use stats already?
> >
> > Let's say a column contains only the intuitive notion of floats: no
> > NaNs, no infs, no -0.0. If we are filtering for $COL != a and the
> > row-group stats are b <= $COL <= c, were a < b, we can know that the
> > whole row group can be included. The addition of NaNs doesn't change
> > that.
> >
> > OTOH, if b <= a <= c, then we have to check the whole row group, and
> > the addition of NaNs doesn't change that.
> >
> > On Tue, Feb 20, 2018 at 9:14 AM, Alexander Behm <al...@cloudera.com>
> > wrote:
> > > On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <zi...@cloudera.com>
> wrote:
> > >
> > >> Hi,
> > >>
> > >> Tim, I added your suggestion to introduce a new ColumnOrder to
> > PARQUET-1222
> > >> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
> > >> solution.
> > >>
> > >> Alex, not writing min/max if there is a NaN is indeed a feasible
> > quick-fix,
> > >> but I think it would be better to just ignore NaN-s for the pruposes
> of
> > >> min/max stats. For reading, we can ignore stats that contain a NaN. We
> > also
> > >> shouldn't use stats when looking for a NaN. -0 and +0 will still be
> > >> problematic, though.
> > >>
> > >
> > > I don't think ignoring NaNs is correct. Consider a predicate <col> !=
> > > <constant> that would evaluate to true against NaN. We cannot reliable
> > use
> > > stats for such a predicate.
> > >
> > >
> > >>
> > >> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
> > >> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me:
> > "This
> > >> function is not required to be sensitive to the sign of zero, although
> > some
> > >> implementations additionally enforce that if one argument is +0 and
> the
> > >> other is -0, then +0 is returned." [1
> > >> <http://en.cppreference.com/w/c/numeric/math/fmax>]
> > >>
> > >> Br,
> > >>
> > >> Zoltan
> > >>
> > >>
> > >>
> > >> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jb...@cloudera.com>
> wrote:
> > >>
> > >> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> > >> > <bo...@cloudera.com> wrote:
> > >> > > I would just like to mention that the fmax() / fmin() functions in
> > >> C/C++
> > >> > > Math library follow the aforementioned IEEE 754-2008 min and max
> > >> > > specification:
> > >> > > http://en.cppreference.com/w/c/numeric/math/fmax
> > >> > >
> > >> > > I think this behavior is also the most intuitive and useful
> > regarding
> > >> to
> > >> > > statistics. If we want to select the max value, I think it's
> > reasonable
> > >> > to
> > >> > > ignore nulls and not-numbers.
> > >> >
> > >> > It should be noted that this is different than the total ordering
> > >> > predicate. With that predicate, -NaN < -inf < negative numbers <
> -0.0
> > >> > < +0.0 < positive numbers < +inf < +NaN
> > >> >
> > >> > fmax appears to be closest to IEEE-754's maxNum, but not quite
> > >> > matching for some corner cases (-0.0, signalling NaN), but I'm not
> > >> > 100% sure on that.
> > >> >
> > >>
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Alexander Behm <al...@cloudera.com>.

Today, Impala does not evaluate "<col> != <constant>" against stats, but as
Zoltan pointed out there is a way to reasonably do that. It does not work
if we ignore NaN though, so we need to be careful.

On Tue, Feb 20, 2018 at 9:24 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:

> In parquet-mr, if you are looking for a value that is not equal to some
> reference value r and stats are min = r and max = r then that row group is
> discarded, because there can not be any other values in that row group.
>
> On Tue, Feb 20, 2018 at 6:21 PM Jim Apple <jb...@cloudera.com> wrote:
>
> > For that predicate in particular, does Impala use stats already?
> >
> > Let's say a column contains only the intuitive notion of floats: no
> > NaNs, no infs, no -0.0. If we are filtering for $COL != a and the
> > row-group stats are b <= $COL <= c, were a < b, we can know that the
> > whole row group can be included. The addition of NaNs doesn't change
> > that.
> >
> > OTOH, if b <= a <= c, then we have to check the whole row group, and
> > the addition of NaNs doesn't change that.
> >
> > On Tue, Feb 20, 2018 at 9:14 AM, Alexander Behm <al...@cloudera.com>
> > wrote:
> > > On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <zi...@cloudera.com>
> wrote:
> > >
> > >> Hi,
> > >>
> > >> Tim, I added your suggestion to introduce a new ColumnOrder to
> > PARQUET-1222
> > >> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
> > >> solution.
> > >>
> > >> Alex, not writing min/max if there is a NaN is indeed a feasible
> > quick-fix,
> > >> but I think it would be better to just ignore NaN-s for the pruposes
> of
> > >> min/max stats. For reading, we can ignore stats that contain a NaN. We
> > also
> > >> shouldn't use stats when looking for a NaN. -0 and +0 will still be
> > >> problematic, though.
> > >>
> > >
> > > I don't think ignoring NaNs is correct. Consider a predicate <col> !=
> > > <constant> that would evaluate to true against NaN. We cannot reliable
> > use
> > > stats for such a predicate.
> > >
> > >
> > >>
> > >> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
> > >> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me:
> > "This
> > >> function is not required to be sensitive to the sign of zero, although
> > some
> > >> implementations additionally enforce that if one argument is +0 and
> the
> > >> other is -0, then +0 is returned." [1
> > >> <http://en.cppreference.com/w/c/numeric/math/fmax>]
> > >>
> > >> Br,
> > >>
> > >> Zoltan
> > >>
> > >>
> > >>
> > >> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jb...@cloudera.com>
> wrote:
> > >>
> > >> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> > >> > <bo...@cloudera.com> wrote:
> > >> > > I would just like to mention that the fmax() / fmin() functions in
> > >> C/C++
> > >> > > Math library follow the aforementioned IEEE 754-2008 min and max
> > >> > > specification:
> > >> > > http://en.cppreference.com/w/c/numeric/math/fmax
> > >> > >
> > >> > > I think this behavior is also the most intuitive and useful
> > regarding
> > >> to
> > >> > > statistics. If we want to select the max value, I think it's
> > reasonable
> > >> > to
> > >> > > ignore nulls and not-numbers.
> > >> >
> > >> > It should be noted that this is different than the total ordering
> > >> > predicate. With that predicate, -NaN < -inf < negative numbers <
> -0.0
> > >> > < +0.0 < positive numbers < +inf < +NaN
> > >> >
> > >> > fmax appears to be closest to IEEE-754's maxNum, but not quite
> > >> > matching for some corner cases (-0.0, signalling NaN), but I'm not
> > >> > 100% sure on that.
> > >> >
> > >>
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Zoltan Ivanfi <zi...@cloudera.com>.

In parquet-mr, if you are looking for a value that is not equal to some
reference value r and stats are min = r and max = r then that row group is
discarded, because there can not be any other values in that row group.

On Tue, Feb 20, 2018 at 6:21 PM Jim Apple <jb...@cloudera.com> wrote:

> For that predicate in particular, does Impala use stats already?
>
> Let's say a column contains only the intuitive notion of floats: no
> NaNs, no infs, no -0.0. If we are filtering for $COL != a and the
> row-group stats are b <= $COL <= c, were a < b, we can know that the
> whole row group can be included. The addition of NaNs doesn't change
> that.
>
> OTOH, if b <= a <= c, then we have to check the whole row group, and
> the addition of NaNs doesn't change that.
>
> On Tue, Feb 20, 2018 at 9:14 AM, Alexander Behm <al...@cloudera.com>
> wrote:
> > On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >
> >> Hi,
> >>
> >> Tim, I added your suggestion to introduce a new ColumnOrder to
> PARQUET-1222
> >> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
> >> solution.
> >>
> >> Alex, not writing min/max if there is a NaN is indeed a feasible
> quick-fix,
> >> but I think it would be better to just ignore NaN-s for the pruposes of
> >> min/max stats. For reading, we can ignore stats that contain a NaN. We
> also
> >> shouldn't use stats when looking for a NaN. -0 and +0 will still be
> >> problematic, though.
> >>
> >
> > I don't think ignoring NaNs is correct. Consider a predicate <col> !=
> > <constant> that would evaluate to true against NaN. We cannot reliable
> use
> > stats for such a predicate.
> >
> >
> >>
> >> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
> >> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me:
> "This
> >> function is not required to be sensitive to the sign of zero, although
> some
> >> implementations additionally enforce that if one argument is +0 and the
> >> other is -0, then +0 is returned." [1
> >> <http://en.cppreference.com/w/c/numeric/math/fmax>]
> >>
> >> Br,
> >>
> >> Zoltan
> >>
> >>
> >>
> >> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jb...@cloudera.com> wrote:
> >>
> >> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> >> > <bo...@cloudera.com> wrote:
> >> > > I would just like to mention that the fmax() / fmin() functions in
> >> C/C++
> >> > > Math library follow the aforementioned IEEE 754-2008 min and max
> >> > > specification:
> >> > > http://en.cppreference.com/w/c/numeric/math/fmax
> >> > >
> >> > > I think this behavior is also the most intuitive and useful
> regarding
> >> to
> >> > > statistics. If we want to select the max value, I think it's
> reasonable
> >> > to
> >> > > ignore nulls and not-numbers.
> >> >
> >> > It should be noted that this is different than the total ordering
> >> > predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
> >> > < +0.0 < positive numbers < +inf < +NaN
> >> >
> >> > fmax appears to be closest to IEEE-754's maxNum, but not quite
> >> > matching for some corner cases (-0.0, signalling NaN), but I'm not
> >> > 100% sure on that.
> >> >
> >>
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Zoltan Ivanfi <zi...@cloudera.com>.

In parquet-mr, if you are looking for a value that is not equal to some
reference value r and stats are min = r and max = r then that row group is
discarded, because there can not be any other values in that row group.

On Tue, Feb 20, 2018 at 6:21 PM Jim Apple <jb...@cloudera.com> wrote:

> For that predicate in particular, does Impala use stats already?
>
> Let's say a column contains only the intuitive notion of floats: no
> NaNs, no infs, no -0.0. If we are filtering for $COL != a and the
> row-group stats are b <= $COL <= c, were a < b, we can know that the
> whole row group can be included. The addition of NaNs doesn't change
> that.
>
> OTOH, if b <= a <= c, then we have to check the whole row group, and
> the addition of NaNs doesn't change that.
>
> On Tue, Feb 20, 2018 at 9:14 AM, Alexander Behm <al...@cloudera.com>
> wrote:
> > On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >
> >> Hi,
> >>
> >> Tim, I added your suggestion to introduce a new ColumnOrder to
> PARQUET-1222
> >> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
> >> solution.
> >>
> >> Alex, not writing min/max if there is a NaN is indeed a feasible
> quick-fix,
> >> but I think it would be better to just ignore NaN-s for the pruposes of
> >> min/max stats. For reading, we can ignore stats that contain a NaN. We
> also
> >> shouldn't use stats when looking for a NaN. -0 and +0 will still be
> >> problematic, though.
> >>
> >
> > I don't think ignoring NaNs is correct. Consider a predicate <col> !=
> > <constant> that would evaluate to true against NaN. We cannot reliable
> use
> > stats for such a predicate.
> >
> >
> >>
> >> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
> >> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me:
> "This
> >> function is not required to be sensitive to the sign of zero, although
> some
> >> implementations additionally enforce that if one argument is +0 and the
> >> other is -0, then +0 is returned." [1
> >> <http://en.cppreference.com/w/c/numeric/math/fmax>]
> >>
> >> Br,
> >>
> >> Zoltan
> >>
> >>
> >>
> >> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jb...@cloudera.com> wrote:
> >>
> >> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> >> > <bo...@cloudera.com> wrote:
> >> > > I would just like to mention that the fmax() / fmin() functions in
> >> C/C++
> >> > > Math library follow the aforementioned IEEE 754-2008 min and max
> >> > > specification:
> >> > > http://en.cppreference.com/w/c/numeric/math/fmax
> >> > >
> >> > > I think this behavior is also the most intuitive and useful
> regarding
> >> to
> >> > > statistics. If we want to select the max value, I think it's
> reasonable
> >> > to
> >> > > ignore nulls and not-numbers.
> >> >
> >> > It should be noted that this is different than the total ordering
> >> > predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
> >> > < +0.0 < positive numbers < +inf < +NaN
> >> >
> >> > fmax appears to be closest to IEEE-754's maxNum, but not quite
> >> > matching for some corner cases (-0.0, signalling NaN), but I'm not
> >> > 100% sure on that.
> >> >
> >>
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Jim Apple <jb...@cloudera.com>.

For that predicate in particular, does Impala use stats already?

Let's say a column contains only the intuitive notion of floats: no
NaNs, no infs, no -0.0. If we are filtering for $COL != a and the
row-group stats are b <= $COL <= c, were a < b, we can know that the
whole row group can be included. The addition of NaNs doesn't change
that.

OTOH, if b <= a <= c, then we have to check the whole row group, and
the addition of NaNs doesn't change that.

On Tue, Feb 20, 2018 at 9:14 AM, Alexander Behm <al...@cloudera.com> wrote:
> On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
>
>> Hi,
>>
>> Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222
>> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
>> solution.
>>
>> Alex, not writing min/max if there is a NaN is indeed a feasible quick-fix,
>> but I think it would be better to just ignore NaN-s for the pruposes of
>> min/max stats. For reading, we can ignore stats that contain a NaN. We also
>> shouldn't use stats when looking for a NaN. -0 and +0 will still be
>> problematic, though.
>>
>
> I don't think ignoring NaNs is correct. Consider a predicate <col> !=
> <constant> that would evaluate to true against NaN. We cannot reliable use
> stats for such a predicate.
>
>
>>
>> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
>> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This
>> function is not required to be sensitive to the sign of zero, although some
>> implementations additionally enforce that if one argument is +0 and the
>> other is -0, then +0 is returned." [1
>> <http://en.cppreference.com/w/c/numeric/math/fmax>]
>>
>> Br,
>>
>> Zoltan
>>
>>
>>
>> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jb...@cloudera.com> wrote:
>>
>> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
>> > <bo...@cloudera.com> wrote:
>> > > I would just like to mention that the fmax() / fmin() functions in
>> C/C++
>> > > Math library follow the aforementioned IEEE 754-2008 min and max
>> > > specification:
>> > > http://en.cppreference.com/w/c/numeric/math/fmax
>> > >
>> > > I think this behavior is also the most intuitive and useful regarding
>> to
>> > > statistics. If we want to select the max value, I think it's reasonable
>> > to
>> > > ignore nulls and not-numbers.
>> >
>> > It should be noted that this is different than the total ordering
>> > predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
>> > < +0.0 < positive numbers < +inf < +NaN
>> >
>> > fmax appears to be closest to IEEE-754's maxNum, but not quite
>> > matching for some corner cases (-0.0, signalling NaN), but I'm not
>> > 100% sure on that.
>> >
>>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Jim Apple <jb...@cloudera.com>.

For that predicate in particular, does Impala use stats already?

Let's say a column contains only the intuitive notion of floats: no
NaNs, no infs, no -0.0. If we are filtering for $COL != a and the
row-group stats are b <= $COL <= c, were a < b, we can know that the
whole row group can be included. The addition of NaNs doesn't change
that.

OTOH, if b <= a <= c, then we have to check the whole row group, and
the addition of NaNs doesn't change that.

On Tue, Feb 20, 2018 at 9:14 AM, Alexander Behm <al...@cloudera.com> wrote:
> On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
>
>> Hi,
>>
>> Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222
>> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
>> solution.
>>
>> Alex, not writing min/max if there is a NaN is indeed a feasible quick-fix,
>> but I think it would be better to just ignore NaN-s for the pruposes of
>> min/max stats. For reading, we can ignore stats that contain a NaN. We also
>> shouldn't use stats when looking for a NaN. -0 and +0 will still be
>> problematic, though.
>>
>
> I don't think ignoring NaNs is correct. Consider a predicate <col> !=
> <constant> that would evaluate to true against NaN. We cannot reliable use
> stats for such a predicate.
>
>
>>
>> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
>> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This
>> function is not required to be sensitive to the sign of zero, although some
>> implementations additionally enforce that if one argument is +0 and the
>> other is -0, then +0 is returned." [1
>> <http://en.cppreference.com/w/c/numeric/math/fmax>]
>>
>> Br,
>>
>> Zoltan
>>
>>
>>
>> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jb...@cloudera.com> wrote:
>>
>> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
>> > <bo...@cloudera.com> wrote:
>> > > I would just like to mention that the fmax() / fmin() functions in
>> C/C++
>> > > Math library follow the aforementioned IEEE 754-2008 min and max
>> > > specification:
>> > > http://en.cppreference.com/w/c/numeric/math/fmax
>> > >
>> > > I think this behavior is also the most intuitive and useful regarding
>> to
>> > > statistics. If we want to select the max value, I think it's reasonable
>> > to
>> > > ignore nulls and not-numbers.
>> >
>> > It should be noted that this is different than the total ordering
>> > predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
>> > < +0.0 < positive numbers < +inf < +NaN
>> >
>> > fmax appears to be closest to IEEE-754's maxNum, but not quite
>> > matching for some corner cases (-0.0, signalling NaN), but I'm not
>> > 100% sure on that.
>> >
>>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Alexander Behm <al...@cloudera.com>.

On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Hi,
>
> Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222
> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
> solution.
>
> Alex, not writing min/max if there is a NaN is indeed a feasible quick-fix,
> but I think it would be better to just ignore NaN-s for the pruposes of
> min/max stats. For reading, we can ignore stats that contain a NaN. We also
> shouldn't use stats when looking for a NaN. -0 and +0 will still be
> problematic, though.
>

I don't think ignoring NaNs is correct. Consider a predicate <col> !=
<constant> that would evaluate to true against NaN. We cannot reliable use
stats for such a predicate.


>
> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This
> function is not required to be sensitive to the sign of zero, although some
> implementations additionally enforce that if one argument is +0 and the
> other is -0, then +0 is returned." [1
> <http://en.cppreference.com/w/c/numeric/math/fmax>]
>
> Br,
>
> Zoltan
>
>
>
> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jb...@cloudera.com> wrote:
>
> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> > <bo...@cloudera.com> wrote:
> > > I would just like to mention that the fmax() / fmin() functions in
> C/C++
> > > Math library follow the aforementioned IEEE 754-2008 min and max
> > > specification:
> > > http://en.cppreference.com/w/c/numeric/math/fmax
> > >
> > > I think this behavior is also the most intuitive and useful regarding
> to
> > > statistics. If we want to select the max value, I think it's reasonable
> > to
> > > ignore nulls and not-numbers.
> >
> > It should be noted that this is different than the total ordering
> > predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
> > < +0.0 < positive numbers < +inf < +NaN
> >
> > fmax appears to be closest to IEEE-754's maxNum, but not quite
> > matching for some corner cases (-0.0, signalling NaN), but I'm not
> > 100% sure on that.
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Tim Armstrong <ta...@cloudera.com>.

We could drop NaNs and require that -0 be normalised to +0 when writing out
stats. That would remove any degrees of freedom from the writer and then
straightforward comparison with =, <, >, >=, <=, != would work as expected.

On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Hi,
>
> Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222
> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
> solution.
>
> Alex, not writing min/max if there is a NaN is indeed a feasible quick-fix,
> but I think it would be better to just ignore NaN-s for the pruposes of
> min/max stats. For reading, we can ignore stats that contain a NaN. We also
> shouldn't use stats when looking for a NaN. -0 and +0 will still be
> problematic, though.
>
> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This
> function is not required to be sensitive to the sign of zero, although some
> implementations additionally enforce that if one argument is +0 and the
> other is -0, then +0 is returned." [1
> <http://en.cppreference.com/w/c/numeric/math/fmax>]
>
> Br,
>
> Zoltan
>
>
>
> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jb...@cloudera.com> wrote:
>
> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> > <bo...@cloudera.com> wrote:
> > > I would just like to mention that the fmax() / fmin() functions in
> C/C++
> > > Math library follow the aforementioned IEEE 754-2008 min and max
> > > specification:
> > > http://en.cppreference.com/w/c/numeric/math/fmax
> > >
> > > I think this behavior is also the most intuitive and useful regarding
> to
> > > statistics. If we want to select the max value, I think it's reasonable
> > to
> > > ignore nulls and not-numbers.
> >
> > It should be noted that this is different than the total ordering
> > predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
> > < +0.0 < positive numbers < +inf < +NaN
> >
> > fmax appears to be closest to IEEE-754's maxNum, but not quite
> > matching for some corner cases (-0.0, signalling NaN), but I'm not
> > 100% sure on that.
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Alexander Behm <al...@cloudera.com>.

On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Hi,
>
> Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222
> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
> solution.
>
> Alex, not writing min/max if there is a NaN is indeed a feasible quick-fix,
> but I think it would be better to just ignore NaN-s for the pruposes of
> min/max stats. For reading, we can ignore stats that contain a NaN. We also
> shouldn't use stats when looking for a NaN. -0 and +0 will still be
> problematic, though.
>

I don't think ignoring NaNs is correct. Consider a predicate <col> !=
<constant> that would evaluate to true against NaN. We cannot reliable use
stats for such a predicate.


>
> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This
> function is not required to be sensitive to the sign of zero, although some
> implementations additionally enforce that if one argument is +0 and the
> other is -0, then +0 is returned." [1
> <http://en.cppreference.com/w/c/numeric/math/fmax>]
>
> Br,
>
> Zoltan
>
>
>
> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jb...@cloudera.com> wrote:
>
> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> > <bo...@cloudera.com> wrote:
> > > I would just like to mention that the fmax() / fmin() functions in
> C/C++
> > > Math library follow the aforementioned IEEE 754-2008 min and max
> > > specification:
> > > http://en.cppreference.com/w/c/numeric/math/fmax
> > >
> > > I think this behavior is also the most intuitive and useful regarding
> to
> > > statistics. If we want to select the max value, I think it's reasonable
> > to
> > > ignore nulls and not-numbers.
> >
> > It should be noted that this is different than the total ordering
> > predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
> > < +0.0 < positive numbers < +inf < +NaN
> >
> > fmax appears to be closest to IEEE-754's maxNum, but not quite
> > matching for some corner cases (-0.0, signalling NaN), but I'm not
> > 100% sure on that.
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Zoltan Ivanfi <zi...@cloudera.com>.

Hi,

Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222
<https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
solution.

Alex, not writing min/max if there is a NaN is indeed a feasible quick-fix,
but I think it would be better to just ignore NaN-s for the pruposes of
min/max stats. For reading, we can ignore stats that contain a NaN. We also
shouldn't use stats when looking for a NaN. -0 and +0 will still be
problematic, though.

Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This
function is not required to be sensitive to the sign of zero, although some
implementations additionally enforce that if one argument is +0 and the
other is -0, then +0 is returned." [1
<http://en.cppreference.com/w/c/numeric/math/fmax>]

Br,

Zoltan

On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jb...@cloudera.com> wrote:

> On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> <bo...@cloudera.com> wrote:
> > I would just like to mention that the fmax() / fmin() functions in C/C++
> > Math library follow the aforementioned IEEE 754-2008 min and max
> > specification:
> > http://en.cppreference.com/w/c/numeric/math/fmax
> >
> > I think this behavior is also the most intuitive and useful regarding to
> > statistics. If we want to select the max value, I think it's reasonable
> to
> > ignore nulls and not-numbers.
>
> It should be noted that this is different than the total ordering
> predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
> < +0.0 < positive numbers < +inf < +NaN
>
> fmax appears to be closest to IEEE-754's maxNum, but not quite
> matching for some corner cases (-0.0, signalling NaN), but I'm not
> 100% sure on that.
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Zoltan Ivanfi <zi...@cloudera.com>.

Hi,

Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222
<https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
solution.

Alex, not writing min/max if there is a NaN is indeed a feasible quick-fix,
but I think it would be better to just ignore NaN-s for the pruposes of
min/max stats. For reading, we can ignore stats that contain a NaN. We also
shouldn't use stats when looking for a NaN. -0 and +0 will still be
problematic, though.

Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This
function is not required to be sensitive to the sign of zero, although some
implementations additionally enforce that if one argument is +0 and the
other is -0, then +0 is returned." [1
<http://en.cppreference.com/w/c/numeric/math/fmax>]

Br,

Zoltan

On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jb...@cloudera.com> wrote:

> On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
> <bo...@cloudera.com> wrote:
> > I would just like to mention that the fmax() / fmin() functions in C/C++
> > Math library follow the aforementioned IEEE 754-2008 min and max
> > specification:
> > http://en.cppreference.com/w/c/numeric/math/fmax
> >
> > I think this behavior is also the most intuitive and useful regarding to
> > statistics. If we want to select the max value, I think it's reasonable
> to
> > ignore nulls and not-numbers.
>
> It should be noted that this is different than the total ordering
> predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
> < +0.0 < positive numbers < +inf < +NaN
>
> fmax appears to be closest to IEEE-754's maxNum, but not quite
> matching for some corner cases (-0.0, signalling NaN), but I'm not
> 100% sure on that.
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Jim Apple <jb...@cloudera.com>.

On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
<bo...@cloudera.com> wrote:
> I would just like to mention that the fmax() / fmin() functions in C/C++
> Math library follow the aforementioned IEEE 754-2008 min and max
> specification:
> http://en.cppreference.com/w/c/numeric/math/fmax
>
> I think this behavior is also the most intuitive and useful regarding to
> statistics. If we want to select the max value, I think it's reasonable to
> ignore nulls and not-numbers.

It should be noted that this is different than the total ordering
predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
< +0.0 < positive numbers < +inf < +NaN

fmax appears to be closest to IEEE-754's maxNum, but not quite
matching for some corner cases (-0.0, signalling NaN), but I'm not
100% sure on that.

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Jim Apple <jb...@cloudera.com>.

On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
<bo...@cloudera.com> wrote:
> I would just like to mention that the fmax() / fmin() functions in C/C++
> Math library follow the aforementioned IEEE 754-2008 min and max
> specification:
> http://en.cppreference.com/w/c/numeric/math/fmax
>
> I think this behavior is also the most intuitive and useful regarding to
> statistics. If we want to select the max value, I think it's reasonable to
> ignore nulls and not-numbers.

It should be noted that this is different than the total ordering
predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
< +0.0 < positive numbers < +inf < +NaN

fmax appears to be closest to IEEE-754's maxNum, but not quite
matching for some corner cases (-0.0, signalling NaN), but I'm not
100% sure on that.

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Zoltan Borok-Nagy <bo...@cloudera.com>.

I would just like to mention that the fmax() / fmin() functions in C/C++
Math library follow the aforementioned IEEE 754-2008 min and max
specification:
http://en.cppreference.com/w/c/numeric/math/fmax

I think this behavior is also the most intuitive and useful regarding to
statistics. If we want to select the max value, I think it's reasonable to
ignore nulls and not-numbers.
I think it serves well the common case as well. E.g. there are a lot of
numbers, and some NaNs, I don't know if we want to ruin our upper bound by
choosing NaN as max.

And if there's still a NaN by an old writer, we can treat is as Inf.

Just my two cents.


On Fri, Feb 16, 2018 at 6:38 PM, Tim Armstrong <ta...@cloudera.com>
wrote:

> The reader still can't correctly interpret those stats without knowing
> about the behaviour of that specific writer though, because it can't assume
> the absence of NaNs unless it knows that they are reading a file written by
> a writer that drops stats when it sees NaNs.
>
> It *could* fix the behaviour of some naive readers that don't correctly
> handle the current ambiguity in the specification, but I think those need
> to be fixed anyway because they will return wrong results for existing
> files.
>
> In the process of fixing the readers, you could then modify the readers so
> that they are aware of this special writer that drops stats with NaNs and
> knows that it is safe to use them, but I think those kind of shared
> reader-writer assumptions are essentially like having an unofficial
> extension of the Parquet spec.
>
> On Fri, Feb 16, 2018 at 9:20 AM, Lars Volker <lv...@cloudera.com> wrote:
>
> > Yeah, I missed that. We set it per column, so all other types could keep
> > TypeDefinedOrder and floats could have something like
> NanAwareDoubleOrder.
> >
> > On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <ta...@cloudera.com>
> > wrote:
> >
> > > We wouldn't need to rev the whole TypeDefinedOrder thing right?
> Couldn't
> > we
> > > just define a special order for floats? Essentially it would be a tag
> for
> > > writers to say "hey I know about this total order thing".
> > >
> > > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <lv...@cloudera.com> wrote:
> > >
> > > > I think one idea behind the column order fields was that if a reader
> > does
> > > > not recognize a value there, it needs to ignore the stats. If I
> > remember
> > > > correctly, that was intended to allow us to add new orderings for
> > > > collations, but it also seems useful to address gaps in the spec or
> > known
> > > > broken readers. In this case we would need to deprecate the default
> > > > "TypeDefinedOrder" and replace it with something like
> > > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count
> > up,
> > > > like TypeDefinedOrderV2 and so on.
> > > >
> > > > An alternative would be to list all writers that are known to have
> > > written
> > > > incorrect stats. However that will not prevent old implementations to
> > > > misinterpret correct stats - which I think was the main reason why we
> > > added
> > > > new stats fields.
> > > >
> > > >
> > > >
> > > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <
> > alex.behm@cloudera.com>
> > > > wrote:
> > > >
> > > > > I hope the common cases is that data files do not contain these
> > special
> > > > > float values. As the simplest solution, how about writers refrain
> > from
> > > > > populating the stats if a special value is encountered?
> > > > >
> > > > > That fix does not preclude a more thorough solution in the future,
> > but
> > > it
> > > > > addresses the common case quickly.
> > > > >
> > > > > For existing data files we could check the writer version ignore
> > > filters
> > > > on
> > > > > float/double. I don't know whether min/max filtering is common on
> > > > > float/double, but I suspect it's not.
> > > > >
> > > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> > > tarmstrong@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > There is an extensibility mechanism with the ColumnOrder union -
> I
> > > > think
> > > > > > that was meant to avoid the need to add new stat fields?
> > > > > >
> > > > > > Given that the bug was in the Parquet spec, we'll need to make a
> > spec
> > > > > > change anyway, so we could add a new ColumnOrder -
> > > > > FloatingPointTotalOrder?
> > > > > > at the same time as fixing the gap in the spec.
> > > > > >
> > > > > > It could make sense to declare that the default ordering for
> > > > > floats/doubles
> > > > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > > > arbitrarily
> > > > > > ordered) and readers should either implement the required logic
> to
> > > > handle
> > > > > > that correctly (I had some ideas here:
> > > > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-
> 16366106)
> > > > > > or ignore the stats.
> > > > > >
> > > > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jbapple@cloudera.com
> >
> > > > wrote:
> > > > > >
> > > > > > > > We could have a similar problem
> > > > > > > > with not finding +0.0 values because a -0.0 is written to the
> > > > > max_value
> > > > > > > > field by some component that considers them the same.
> > > > > > >
> > > > > > > My hope is that the filtering would behave sanely, since -0.0
> ==
> > > +0.0
> > > > > > > under the real-number-inspired ordering, which is distinguished
> > > from
> > > > > > > total Ordering, and which is also what you get when you use the
> > > > > > > default C/C++ operators <, >, <=, ==, and so on.
> > > > > > >
> > > > > > > You can distinguish between -0.0 and +0.0 without using total
> > > > ordering
> > > > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some
> > other
> > > > > > > ways to distinguish, I suspect, but that's the simplest one I
> > > recall
> > > > > > > at the moment.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Alexander Behm <al...@cloudera.com>.

On Fri, Feb 16, 2018 at 9:38 AM, Tim Armstrong <ta...@cloudera.com>
wrote:

> The reader still can't correctly interpret those stats without knowing
> about the behaviour of that specific writer though, because it can't assume
> the absence of NaNs unless it knows that they are reading a file written by
> a writer that drops stats when it sees NaNs.
>
> It *could* fix the behaviour of some naive readers that don't correctly
> handle the current ambiguity in the specification, but I think those need
> to be fixed anyway because they will return wrong results for existing
> files.
>
> In the process of fixing the readers, you could then modify the readers so
> that they are aware of this special writer that drops stats with NaNs and
> knows that it is safe to use them, but I think those kind of shared
> reader-writer assumptions are essentially like having an unofficial
> extension of the Parquet spec.
>

Good point. I agree that summarizes the issues.

It is basically treating this issue like a bug in the writer (arguable). I
don't think we should require a rev in the Parquet spec to address bugs in
writers.
Working around bugs will always require checking the writer version in all
readers.

I'm certainly on board with the cleaner solution to adjust the spec - but
writers will always have bugs.

>
> On Fri, Feb 16, 2018 at 9:20 AM, Lars Volker <lv...@cloudera.com> wrote:
>
> > Yeah, I missed that. We set it per column, so all other types could keep
> > TypeDefinedOrder and floats could have something like
> NanAwareDoubleOrder.
> >
> > On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <ta...@cloudera.com>
> > wrote:
> >
> > > We wouldn't need to rev the whole TypeDefinedOrder thing right?
> Couldn't
> > we
> > > just define a special order for floats? Essentially it would be a tag
> for
> > > writers to say "hey I know about this total order thing".
> > >
> > > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <lv...@cloudera.com> wrote:
> > >
> > > > I think one idea behind the column order fields was that if a reader
> > does
> > > > not recognize a value there, it needs to ignore the stats. If I
> > remember
> > > > correctly, that was intended to allow us to add new orderings for
> > > > collations, but it also seems useful to address gaps in the spec or
> > known
> > > > broken readers. In this case we would need to deprecate the default
> > > > "TypeDefinedOrder" and replace it with something like
> > > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count
> > up,
> > > > like TypeDefinedOrderV2 and so on.
> > > >
> > > > An alternative would be to list all writers that are known to have
> > > written
> > > > incorrect stats. However that will not prevent old implementations to
> > > > misinterpret correct stats - which I think was the main reason why we
> > > added
> > > > new stats fields.
> > > >
> > > >
> > > >
> > > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <
> > alex.behm@cloudera.com>
> > > > wrote:
> > > >
> > > > > I hope the common cases is that data files do not contain these
> > special
> > > > > float values. As the simplest solution, how about writers refrain
> > from
> > > > > populating the stats if a special value is encountered?
> > > > >
> > > > > That fix does not preclude a more thorough solution in the future,
> > but
> > > it
> > > > > addresses the common case quickly.
> > > > >
> > > > > For existing data files we could check the writer version ignore
> > > filters
> > > > on
> > > > > float/double. I don't know whether min/max filtering is common on
> > > > > float/double, but I suspect it's not.
> > > > >
> > > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> > > tarmstrong@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > There is an extensibility mechanism with the ColumnOrder union -
> I
> > > > think
> > > > > > that was meant to avoid the need to add new stat fields?
> > > > > >
> > > > > > Given that the bug was in the Parquet spec, we'll need to make a
> > spec
> > > > > > change anyway, so we could add a new ColumnOrder -
> > > > > FloatingPointTotalOrder?
> > > > > > at the same time as fixing the gap in the spec.
> > > > > >
> > > > > > It could make sense to declare that the default ordering for
> > > > > floats/doubles
> > > > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > > > arbitrarily
> > > > > > ordered) and readers should either implement the required logic
> to
> > > > handle
> > > > > > that correctly (I had some ideas here:
> > > > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-
> 16366106)
> > > > > > or ignore the stats.
> > > > > >
> > > > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jbapple@cloudera.com
> >
> > > > wrote:
> > > > > >
> > > > > > > > We could have a similar problem
> > > > > > > > with not finding +0.0 values because a -0.0 is written to the
> > > > > max_value
> > > > > > > > field by some component that considers them the same.
> > > > > > >
> > > > > > > My hope is that the filtering would behave sanely, since -0.0
> ==
> > > +0.0
> > > > > > > under the real-number-inspired ordering, which is distinguished
> > > from
> > > > > > > total Ordering, and which is also what you get when you use the
> > > > > > > default C/C++ operators <, >, <=, ==, and so on.
> > > > > > >
> > > > > > > You can distinguish between -0.0 and +0.0 without using total
> > > > ordering
> > > > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some
> > other
> > > > > > > ways to distinguish, I suspect, but that's the simplest one I
> > > recall
> > > > > > > at the moment.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Zoltan Borok-Nagy <bo...@cloudera.com>.

I would just like to mention that the fmax() / fmin() functions in C/C++
Math library follow the aforementioned IEEE 754-2008 min and max
specification:
http://en.cppreference.com/w/c/numeric/math/fmax

I think this behavior is also the most intuitive and useful regarding to
statistics. If we want to select the max value, I think it's reasonable to
ignore nulls and not-numbers.
I think it serves well the common case as well. E.g. there are a lot of
numbers, and some NaNs, I don't know if we want to ruin our upper bound by
choosing NaN as max.

And if there's still a NaN by an old writer, we can treat is as Inf.

Just my two cents.


On Fri, Feb 16, 2018 at 6:38 PM, Tim Armstrong <ta...@cloudera.com>
wrote:

> The reader still can't correctly interpret those stats without knowing
> about the behaviour of that specific writer though, because it can't assume
> the absence of NaNs unless it knows that they are reading a file written by
> a writer that drops stats when it sees NaNs.
>
> It *could* fix the behaviour of some naive readers that don't correctly
> handle the current ambiguity in the specification, but I think those need
> to be fixed anyway because they will return wrong results for existing
> files.
>
> In the process of fixing the readers, you could then modify the readers so
> that they are aware of this special writer that drops stats with NaNs and
> knows that it is safe to use them, but I think those kind of shared
> reader-writer assumptions are essentially like having an unofficial
> extension of the Parquet spec.
>
> On Fri, Feb 16, 2018 at 9:20 AM, Lars Volker <lv...@cloudera.com> wrote:
>
> > Yeah, I missed that. We set it per column, so all other types could keep
> > TypeDefinedOrder and floats could have something like
> NanAwareDoubleOrder.
> >
> > On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <ta...@cloudera.com>
> > wrote:
> >
> > > We wouldn't need to rev the whole TypeDefinedOrder thing right?
> Couldn't
> > we
> > > just define a special order for floats? Essentially it would be a tag
> for
> > > writers to say "hey I know about this total order thing".
> > >
> > > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <lv...@cloudera.com> wrote:
> > >
> > > > I think one idea behind the column order fields was that if a reader
> > does
> > > > not recognize a value there, it needs to ignore the stats. If I
> > remember
> > > > correctly, that was intended to allow us to add new orderings for
> > > > collations, but it also seems useful to address gaps in the spec or
> > known
> > > > broken readers. In this case we would need to deprecate the default
> > > > "TypeDefinedOrder" and replace it with something like
> > > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count
> > up,
> > > > like TypeDefinedOrderV2 and so on.
> > > >
> > > > An alternative would be to list all writers that are known to have
> > > written
> > > > incorrect stats. However that will not prevent old implementations to
> > > > misinterpret correct stats - which I think was the main reason why we
> > > added
> > > > new stats fields.
> > > >
> > > >
> > > >
> > > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <
> > alex.behm@cloudera.com>
> > > > wrote:
> > > >
> > > > > I hope the common cases is that data files do not contain these
> > special
> > > > > float values. As the simplest solution, how about writers refrain
> > from
> > > > > populating the stats if a special value is encountered?
> > > > >
> > > > > That fix does not preclude a more thorough solution in the future,
> > but
> > > it
> > > > > addresses the common case quickly.
> > > > >
> > > > > For existing data files we could check the writer version ignore
> > > filters
> > > > on
> > > > > float/double. I don't know whether min/max filtering is common on
> > > > > float/double, but I suspect it's not.
> > > > >
> > > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> > > tarmstrong@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > There is an extensibility mechanism with the ColumnOrder union -
> I
> > > > think
> > > > > > that was meant to avoid the need to add new stat fields?
> > > > > >
> > > > > > Given that the bug was in the Parquet spec, we'll need to make a
> > spec
> > > > > > change anyway, so we could add a new ColumnOrder -
> > > > > FloatingPointTotalOrder?
> > > > > > at the same time as fixing the gap in the spec.
> > > > > >
> > > > > > It could make sense to declare that the default ordering for
> > > > > floats/doubles
> > > > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > > > arbitrarily
> > > > > > ordered) and readers should either implement the required logic
> to
> > > > handle
> > > > > > that correctly (I had some ideas here:
> > > > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-
> 16366106)
> > > > > > or ignore the stats.
> > > > > >
> > > > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jbapple@cloudera.com
> >
> > > > wrote:
> > > > > >
> > > > > > > > We could have a similar problem
> > > > > > > > with not finding +0.0 values because a -0.0 is written to the
> > > > > max_value
> > > > > > > > field by some component that considers them the same.
> > > > > > >
> > > > > > > My hope is that the filtering would behave sanely, since -0.0
> ==
> > > +0.0
> > > > > > > under the real-number-inspired ordering, which is distinguished
> > > from
> > > > > > > total Ordering, and which is also what you get when you use the
> > > > > > > default C/C++ operators <, >, <=, ==, and so on.
> > > > > > >
> > > > > > > You can distinguish between -0.0 and +0.0 without using total
> > > > ordering
> > > > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some
> > other
> > > > > > > ways to distinguish, I suspect, but that's the simplest one I
> > > recall
> > > > > > > at the moment.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Tim Armstrong <ta...@cloudera.com>.

The reader still can't correctly interpret those stats without knowing
about the behaviour of that specific writer though, because it can't assume
the absence of NaNs unless it knows that they are reading a file written by
a writer that drops stats when it sees NaNs.

It *could* fix the behaviour of some naive readers that don't correctly
handle the current ambiguity in the specification, but I think those need
to be fixed anyway because they will return wrong results for existing
files.

In the process of fixing the readers, you could then modify the readers so
that they are aware of this special writer that drops stats with NaNs and
knows that it is safe to use them, but I think those kind of shared
reader-writer assumptions are essentially like having an unofficial
extension of the Parquet spec.

On Fri, Feb 16, 2018 at 9:20 AM, Lars Volker <lv...@cloudera.com> wrote:

> Yeah, I missed that. We set it per column, so all other types could keep
> TypeDefinedOrder and floats could have something like NanAwareDoubleOrder.
>
> On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <ta...@cloudera.com>
> wrote:
>
> > We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't
> we
> > just define a special order for floats? Essentially it would be a tag for
> > writers to say "hey I know about this total order thing".
> >
> > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <lv...@cloudera.com> wrote:
> >
> > > I think one idea behind the column order fields was that if a reader
> does
> > > not recognize a value there, it needs to ignore the stats. If I
> remember
> > > correctly, that was intended to allow us to add new orderings for
> > > collations, but it also seems useful to address gaps in the spec or
> known
> > > broken readers. In this case we would need to deprecate the default
> > > "TypeDefinedOrder" and replace it with something like
> > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count
> up,
> > > like TypeDefinedOrderV2 and so on.
> > >
> > > An alternative would be to list all writers that are known to have
> > written
> > > incorrect stats. However that will not prevent old implementations to
> > > misinterpret correct stats - which I think was the main reason why we
> > added
> > > new stats fields.
> > >
> > >
> > >
> > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <
> alex.behm@cloudera.com>
> > > wrote:
> > >
> > > > I hope the common cases is that data files do not contain these
> special
> > > > float values. As the simplest solution, how about writers refrain
> from
> > > > populating the stats if a special value is encountered?
> > > >
> > > > That fix does not preclude a more thorough solution in the future,
> but
> > it
> > > > addresses the common case quickly.
> > > >
> > > > For existing data files we could check the writer version ignore
> > filters
> > > on
> > > > float/double. I don't know whether min/max filtering is common on
> > > > float/double, but I suspect it's not.
> > > >
> > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> > tarmstrong@cloudera.com>
> > > > wrote:
> > > >
> > > > > There is an extensibility mechanism with the ColumnOrder union - I
> > > think
> > > > > that was meant to avoid the need to add new stat fields?
> > > > >
> > > > > Given that the bug was in the Parquet spec, we'll need to make a
> spec
> > > > > change anyway, so we could add a new ColumnOrder -
> > > > FloatingPointTotalOrder?
> > > > > at the same time as fixing the gap in the spec.
> > > > >
> > > > > It could make sense to declare that the default ordering for
> > > > floats/doubles
> > > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > > arbitrarily
> > > > > ordered) and readers should either implement the required logic to
> > > handle
> > > > > that correctly (I had some ideas here:
> > > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > > > or ignore the stats.
> > > > >
> > > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com>
> > > wrote:
> > > > >
> > > > > > > We could have a similar problem
> > > > > > > with not finding +0.0 values because a -0.0 is written to the
> > > > max_value
> > > > > > > field by some component that considers them the same.
> > > > > >
> > > > > > My hope is that the filtering would behave sanely, since -0.0 ==
> > +0.0
> > > > > > under the real-number-inspired ordering, which is distinguished
> > from
> > > > > > total Ordering, and which is also what you get when you use the
> > > > > > default C/C++ operators <, >, <=, ==, and so on.
> > > > > >
> > > > > > You can distinguish between -0.0 and +0.0 without using total
> > > ordering
> > > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some
> other
> > > > > > ways to distinguish, I suspect, but that's the simplest one I
> > recall
> > > > > > at the moment.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Tim Armstrong <ta...@cloudera.com>.

The reader still can't correctly interpret those stats without knowing
about the behaviour of that specific writer though, because it can't assume
the absence of NaNs unless it knows that they are reading a file written by
a writer that drops stats when it sees NaNs.

It *could* fix the behaviour of some naive readers that don't correctly
handle the current ambiguity in the specification, but I think those need
to be fixed anyway because they will return wrong results for existing
files.

In the process of fixing the readers, you could then modify the readers so
that they are aware of this special writer that drops stats with NaNs and
knows that it is safe to use them, but I think those kind of shared
reader-writer assumptions are essentially like having an unofficial
extension of the Parquet spec.

On Fri, Feb 16, 2018 at 9:20 AM, Lars Volker <lv...@cloudera.com> wrote:

> Yeah, I missed that. We set it per column, so all other types could keep
> TypeDefinedOrder and floats could have something like NanAwareDoubleOrder.
>
> On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <ta...@cloudera.com>
> wrote:
>
> > We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't
> we
> > just define a special order for floats? Essentially it would be a tag for
> > writers to say "hey I know about this total order thing".
> >
> > On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <lv...@cloudera.com> wrote:
> >
> > > I think one idea behind the column order fields was that if a reader
> does
> > > not recognize a value there, it needs to ignore the stats. If I
> remember
> > > correctly, that was intended to allow us to add new orderings for
> > > collations, but it also seems useful to address gaps in the spec or
> known
> > > broken readers. In this case we would need to deprecate the default
> > > "TypeDefinedOrder" and replace it with something like
> > > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count
> up,
> > > like TypeDefinedOrderV2 and so on.
> > >
> > > An alternative would be to list all writers that are known to have
> > written
> > > incorrect stats. However that will not prevent old implementations to
> > > misinterpret correct stats - which I think was the main reason why we
> > added
> > > new stats fields.
> > >
> > >
> > >
> > > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <
> alex.behm@cloudera.com>
> > > wrote:
> > >
> > > > I hope the common cases is that data files do not contain these
> special
> > > > float values. As the simplest solution, how about writers refrain
> from
> > > > populating the stats if a special value is encountered?
> > > >
> > > > That fix does not preclude a more thorough solution in the future,
> but
> > it
> > > > addresses the common case quickly.
> > > >
> > > > For existing data files we could check the writer version ignore
> > filters
> > > on
> > > > float/double. I don't know whether min/max filtering is common on
> > > > float/double, but I suspect it's not.
> > > >
> > > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> > tarmstrong@cloudera.com>
> > > > wrote:
> > > >
> > > > > There is an extensibility mechanism with the ColumnOrder union - I
> > > think
> > > > > that was meant to avoid the need to add new stat fields?
> > > > >
> > > > > Given that the bug was in the Parquet spec, we'll need to make a
> spec
> > > > > change anyway, so we could add a new ColumnOrder -
> > > > FloatingPointTotalOrder?
> > > > > at the same time as fixing the gap in the spec.
> > > > >
> > > > > It could make sense to declare that the default ordering for
> > > > floats/doubles
> > > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > > arbitrarily
> > > > > ordered) and readers should either implement the required logic to
> > > handle
> > > > > that correctly (I had some ideas here:
> > > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > > > or ignore the stats.
> > > > >
> > > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com>
> > > wrote:
> > > > >
> > > > > > > We could have a similar problem
> > > > > > > with not finding +0.0 values because a -0.0 is written to the
> > > > max_value
> > > > > > > field by some component that considers them the same.
> > > > > >
> > > > > > My hope is that the filtering would behave sanely, since -0.0 ==
> > +0.0
> > > > > > under the real-number-inspired ordering, which is distinguished
> > from
> > > > > > total Ordering, and which is also what you get when you use the
> > > > > > default C/C++ operators <, >, <=, ==, and so on.
> > > > > >
> > > > > > You can distinguish between -0.0 and +0.0 without using total
> > > ordering
> > > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some
> other
> > > > > > ways to distinguish, I suspect, but that's the simplest one I
> > recall
> > > > > > at the moment.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Lars Volker <lv...@cloudera.com>.

Yeah, I missed that. We set it per column, so all other types could keep
TypeDefinedOrder and floats could have something like NanAwareDoubleOrder.

On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <ta...@cloudera.com>
wrote:

> We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't we
> just define a special order for floats? Essentially it would be a tag for
> writers to say "hey I know about this total order thing".
>
> On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <lv...@cloudera.com> wrote:
>
> > I think one idea behind the column order fields was that if a reader does
> > not recognize a value there, it needs to ignore the stats. If I remember
> > correctly, that was intended to allow us to add new orderings for
> > collations, but it also seems useful to address gaps in the spec or known
> > broken readers. In this case we would need to deprecate the default
> > "TypeDefinedOrder" and replace it with something like
> > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count up,
> > like TypeDefinedOrderV2 and so on.
> >
> > An alternative would be to list all writers that are known to have
> written
> > incorrect stats. However that will not prevent old implementations to
> > misinterpret correct stats - which I think was the main reason why we
> added
> > new stats fields.
> >
> >
> >
> > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <al...@cloudera.com>
> > wrote:
> >
> > > I hope the common cases is that data files do not contain these special
> > > float values. As the simplest solution, how about writers refrain from
> > > populating the stats if a special value is encountered?
> > >
> > > That fix does not preclude a more thorough solution in the future, but
> it
> > > addresses the common case quickly.
> > >
> > > For existing data files we could check the writer version ignore
> filters
> > on
> > > float/double. I don't know whether min/max filtering is common on
> > > float/double, but I suspect it's not.
> > >
> > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> tarmstrong@cloudera.com>
> > > wrote:
> > >
> > > > There is an extensibility mechanism with the ColumnOrder union - I
> > think
> > > > that was meant to avoid the need to add new stat fields?
> > > >
> > > > Given that the bug was in the Parquet spec, we'll need to make a spec
> > > > change anyway, so we could add a new ColumnOrder -
> > > FloatingPointTotalOrder?
> > > > at the same time as fixing the gap in the spec.
> > > >
> > > > It could make sense to declare that the default ordering for
> > > floats/doubles
> > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > arbitrarily
> > > > ordered) and readers should either implement the required logic to
> > handle
> > > > that correctly (I had some ideas here:
> > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > > or ignore the stats.
> > > >
> > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com>
> > wrote:
> > > >
> > > > > > We could have a similar problem
> > > > > > with not finding +0.0 values because a -0.0 is written to the
> > > max_value
> > > > > > field by some component that considers them the same.
> > > > >
> > > > > My hope is that the filtering would behave sanely, since -0.0 ==
> +0.0
> > > > > under the real-number-inspired ordering, which is distinguished
> from
> > > > > total Ordering, and which is also what you get when you use the
> > > > > default C/C++ operators <, >, <=, ==, and so on.
> > > > >
> > > > > You can distinguish between -0.0 and +0.0 without using total
> > ordering
> > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > > > ways to distinguish, I suspect, but that's the simplest one I
> recall
> > > > > at the moment.
> > > > >
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Lars Volker <lv...@cloudera.com>.

Yeah, I missed that. We set it per column, so all other types could keep
TypeDefinedOrder and floats could have something like NanAwareDoubleOrder.

On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <ta...@cloudera.com>
wrote:

> We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't we
> just define a special order for floats? Essentially it would be a tag for
> writers to say "hey I know about this total order thing".
>
> On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <lv...@cloudera.com> wrote:
>
> > I think one idea behind the column order fields was that if a reader does
> > not recognize a value there, it needs to ignore the stats. If I remember
> > correctly, that was intended to allow us to add new orderings for
> > collations, but it also seems useful to address gaps in the spec or known
> > broken readers. In this case we would need to deprecate the default
> > "TypeDefinedOrder" and replace it with something like
> > "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count up,
> > like TypeDefinedOrderV2 and so on.
> >
> > An alternative would be to list all writers that are known to have
> written
> > incorrect stats. However that will not prevent old implementations to
> > misinterpret correct stats - which I think was the main reason why we
> added
> > new stats fields.
> >
> >
> >
> > On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <al...@cloudera.com>
> > wrote:
> >
> > > I hope the common cases is that data files do not contain these special
> > > float values. As the simplest solution, how about writers refrain from
> > > populating the stats if a special value is encountered?
> > >
> > > That fix does not preclude a more thorough solution in the future, but
> it
> > > addresses the common case quickly.
> > >
> > > For existing data files we could check the writer version ignore
> filters
> > on
> > > float/double. I don't know whether min/max filtering is common on
> > > float/double, but I suspect it's not.
> > >
> > > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <
> tarmstrong@cloudera.com>
> > > wrote:
> > >
> > > > There is an extensibility mechanism with the ColumnOrder union - I
> > think
> > > > that was meant to avoid the need to add new stat fields?
> > > >
> > > > Given that the bug was in the Parquet spec, we'll need to make a spec
> > > > change anyway, so we could add a new ColumnOrder -
> > > FloatingPointTotalOrder?
> > > > at the same time as fixing the gap in the spec.
> > > >
> > > > It could make sense to declare that the default ordering for
> > > floats/doubles
> > > > is not NaN-aware (i.e. the reader should assume that NaN was
> > arbitrarily
> > > > ordered) and readers should either implement the required logic to
> > handle
> > > > that correctly (I had some ideas here:
> > > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > > or ignore the stats.
> > > >
> > > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com>
> > wrote:
> > > >
> > > > > > We could have a similar problem
> > > > > > with not finding +0.0 values because a -0.0 is written to the
> > > max_value
> > > > > > field by some component that considers them the same.
> > > > >
> > > > > My hope is that the filtering would behave sanely, since -0.0 ==
> +0.0
> > > > > under the real-number-inspired ordering, which is distinguished
> from
> > > > > total Ordering, and which is also what you get when you use the
> > > > > default C/C++ operators <, >, <=, ==, and so on.
> > > > >
> > > > > You can distinguish between -0.0 and +0.0 without using total
> > ordering
> > > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > > > ways to distinguish, I suspect, but that's the simplest one I
> recall
> > > > > at the moment.
> > > > >
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Tim Armstrong <ta...@cloudera.com>.

We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't we
just define a special order for floats? Essentially it would be a tag for
writers to say "hey I know about this total order thing".

On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <lv...@cloudera.com> wrote:

> I think one idea behind the column order fields was that if a reader does
> not recognize a value there, it needs to ignore the stats. If I remember
> correctly, that was intended to allow us to add new orderings for
> collations, but it also seems useful to address gaps in the spec or known
> broken readers. In this case we would need to deprecate the default
> "TypeDefinedOrder" and replace it with something like
> "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count up,
> like TypeDefinedOrderV2 and so on.
>
> An alternative would be to list all writers that are known to have written
> incorrect stats. However that will not prevent old implementations to
> misinterpret correct stats - which I think was the main reason why we added
> new stats fields.
>
>
>
> On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <al...@cloudera.com>
> wrote:
>
> > I hope the common cases is that data files do not contain these special
> > float values. As the simplest solution, how about writers refrain from
> > populating the stats if a special value is encountered?
> >
> > That fix does not preclude a more thorough solution in the future, but it
> > addresses the common case quickly.
> >
> > For existing data files we could check the writer version ignore filters
> on
> > float/double. I don't know whether min/max filtering is common on
> > float/double, but I suspect it's not.
> >
> > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <ta...@cloudera.com>
> > wrote:
> >
> > > There is an extensibility mechanism with the ColumnOrder union - I
> think
> > > that was meant to avoid the need to add new stat fields?
> > >
> > > Given that the bug was in the Parquet spec, we'll need to make a spec
> > > change anyway, so we could add a new ColumnOrder -
> > FloatingPointTotalOrder?
> > > at the same time as fixing the gap in the spec.
> > >
> > > It could make sense to declare that the default ordering for
> > floats/doubles
> > > is not NaN-aware (i.e. the reader should assume that NaN was
> arbitrarily
> > > ordered) and readers should either implement the required logic to
> handle
> > > that correctly (I had some ideas here:
> > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > or ignore the stats.
> > >
> > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com>
> wrote:
> > >
> > > > > We could have a similar problem
> > > > > with not finding +0.0 values because a -0.0 is written to the
> > max_value
> > > > > field by some component that considers them the same.
> > > >
> > > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > > under the real-number-inspired ordering, which is distinguished from
> > > > total Ordering, and which is also what you get when you use the
> > > > default C/C++ operators <, >, <=, ==, and so on.
> > > >
> > > > You can distinguish between -0.0 and +0.0 without using total
> ordering
> > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > > at the moment.
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Tim Armstrong <ta...@cloudera.com>.

We wouldn't need to rev the whole TypeDefinedOrder thing right? Couldn't we
just define a special order for floats? Essentially it would be a tag for
writers to say "hey I know about this total order thing".

On Fri, Feb 16, 2018 at 9:14 AM, Lars Volker <lv...@cloudera.com> wrote:

> I think one idea behind the column order fields was that if a reader does
> not recognize a value there, it needs to ignore the stats. If I remember
> correctly, that was intended to allow us to add new orderings for
> collations, but it also seems useful to address gaps in the spec or known
> broken readers. In this case we would need to deprecate the default
> "TypeDefinedOrder" and replace it with something like
> "TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count up,
> like TypeDefinedOrderV2 and so on.
>
> An alternative would be to list all writers that are known to have written
> incorrect stats. However that will not prevent old implementations to
> misinterpret correct stats - which I think was the main reason why we added
> new stats fields.
>
>
>
> On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <al...@cloudera.com>
> wrote:
>
> > I hope the common cases is that data files do not contain these special
> > float values. As the simplest solution, how about writers refrain from
> > populating the stats if a special value is encountered?
> >
> > That fix does not preclude a more thorough solution in the future, but it
> > addresses the common case quickly.
> >
> > For existing data files we could check the writer version ignore filters
> on
> > float/double. I don't know whether min/max filtering is common on
> > float/double, but I suspect it's not.
> >
> > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <ta...@cloudera.com>
> > wrote:
> >
> > > There is an extensibility mechanism with the ColumnOrder union - I
> think
> > > that was meant to avoid the need to add new stat fields?
> > >
> > > Given that the bug was in the Parquet spec, we'll need to make a spec
> > > change anyway, so we could add a new ColumnOrder -
> > FloatingPointTotalOrder?
> > > at the same time as fixing the gap in the spec.
> > >
> > > It could make sense to declare that the default ordering for
> > floats/doubles
> > > is not NaN-aware (i.e. the reader should assume that NaN was
> arbitrarily
> > > ordered) and readers should either implement the required logic to
> handle
> > > that correctly (I had some ideas here:
> > > https://issues.apache.org/jira/browse/IMPALA-6527?
> > > focusedCommentId=16366106&page=com.atlassian.jira.
> > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > > or ignore the stats.
> > >
> > > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com>
> wrote:
> > >
> > > > > We could have a similar problem
> > > > > with not finding +0.0 values because a -0.0 is written to the
> > max_value
> > > > > field by some component that considers them the same.
> > > >
> > > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > > under the real-number-inspired ordering, which is distinguished from
> > > > total Ordering, and which is also what you get when you use the
> > > > default C/C++ operators <, >, <=, ==, and so on.
> > > >
> > > > You can distinguish between -0.0 and +0.0 without using total
> ordering
> > > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > > at the moment.
> > > >
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Lars Volker <lv...@cloudera.com>.

I think one idea behind the column order fields was that if a reader does
not recognize a value there, it needs to ignore the stats. If I remember
correctly, that was intended to allow us to add new orderings for
collations, but it also seems useful to address gaps in the spec or known
broken readers. In this case we would need to deprecate the default
"TypeDefinedOrder" and replace it with something like
"TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count up,
like TypeDefinedOrderV2 and so on.

An alternative would be to list all writers that are known to have written
incorrect stats. However that will not prevent old implementations to
misinterpret correct stats - which I think was the main reason why we added
new stats fields.



On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <al...@cloudera.com>
wrote:

> I hope the common cases is that data files do not contain these special
> float values. As the simplest solution, how about writers refrain from
> populating the stats if a special value is encountered?
>
> That fix does not preclude a more thorough solution in the future, but it
> addresses the common case quickly.
>
> For existing data files we could check the writer version ignore filters on
> float/double. I don't know whether min/max filtering is common on
> float/double, but I suspect it's not.
>
> On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <ta...@cloudera.com>
> wrote:
>
> > There is an extensibility mechanism with the ColumnOrder union - I think
> > that was meant to avoid the need to add new stat fields?
> >
> > Given that the bug was in the Parquet spec, we'll need to make a spec
> > change anyway, so we could add a new ColumnOrder -
> FloatingPointTotalOrder?
> > at the same time as fixing the gap in the spec.
> >
> > It could make sense to declare that the default ordering for
> floats/doubles
> > is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
> > ordered) and readers should either implement the required logic to handle
> > that correctly (I had some ideas here:
> > https://issues.apache.org/jira/browse/IMPALA-6527?
> > focusedCommentId=16366106&page=com.atlassian.jira.
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > or ignore the stats.
> >
> > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com> wrote:
> >
> > > > We could have a similar problem
> > > > with not finding +0.0 values because a -0.0 is written to the
> max_value
> > > > field by some component that considers them the same.
> > >
> > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > under the real-number-inspired ordering, which is distinguished from
> > > total Ordering, and which is also what you get when you use the
> > > default C/C++ operators <, >, <=, ==, and so on.
> > >
> > > You can distinguish between -0.0 and +0.0 without using total ordering
> > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > at the moment.
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Tim Armstrong <ta...@cloudera.com>.

I don't see a major benefit to a temporary solution. The files are already
out there and we need to implement a fix on the read path regardless. If we
keep writing the stats there's at least some information contained in the
stats that readers can make use of, if they want to implement the required
logic.

Dropping stats if an NaN is encountered also doesn't really address the
other side of the problem - an absence of a NaN in the stats doesn't imply
an absence of a NaN in the data, so the reader can't do anything useful
with the stats anyway unless it's NaN-aware.

On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <al...@cloudera.com>
wrote:

> I hope the common cases is that data files do not contain these special
> float values. As the simplest solution, how about writers refrain from
> populating the stats if a special value is encountered?
>
> That fix does not preclude a more thorough solution in the future, but it
> addresses the common case quickly.
>
> For existing data files we could check the writer version ignore filters on
> float/double. I don't know whether min/max filtering is common on
> float/double, but I suspect it's not.
>
> On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <ta...@cloudera.com>
> wrote:
>
> > There is an extensibility mechanism with the ColumnOrder union - I think
> > that was meant to avoid the need to add new stat fields?
> >
> > Given that the bug was in the Parquet spec, we'll need to make a spec
> > change anyway, so we could add a new ColumnOrder -
> FloatingPointTotalOrder?
> > at the same time as fixing the gap in the spec.
> >
> > It could make sense to declare that the default ordering for
> floats/doubles
> > is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
> > ordered) and readers should either implement the required logic to handle
> > that correctly (I had some ideas here:
> > https://issues.apache.org/jira/browse/IMPALA-6527?
> > focusedCommentId=16366106&page=com.atlassian.jira.
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > or ignore the stats.
> >
> > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com> wrote:
> >
> > > > We could have a similar problem
> > > > with not finding +0.0 values because a -0.0 is written to the
> max_value
> > > > field by some component that considers them the same.
> > >
> > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > under the real-number-inspired ordering, which is distinguished from
> > > total Ordering, and which is also what you get when you use the
> > > default C/C++ operators <, >, <=, ==, and so on.
> > >
> > > You can distinguish between -0.0 and +0.0 without using total ordering
> > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > at the moment.
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Lars Volker <lv...@cloudera.com>.

I think one idea behind the column order fields was that if a reader does
not recognize a value there, it needs to ignore the stats. If I remember
correctly, that was intended to allow us to add new orderings for
collations, but it also seems useful to address gaps in the spec or known
broken readers. In this case we would need to deprecate the default
"TypeDefinedOrder" and replace it with something like
"TypeDefinedOrderWithCorrectOrderingForDoubles". We could also count up,
like TypeDefinedOrderV2 and so on.

An alternative would be to list all writers that are known to have written
incorrect stats. However that will not prevent old implementations to
misinterpret correct stats - which I think was the main reason why we added
new stats fields.



On Fri, Feb 16, 2018 at 9:03 AM, Alexander Behm <al...@cloudera.com>
wrote:

> I hope the common cases is that data files do not contain these special
> float values. As the simplest solution, how about writers refrain from
> populating the stats if a special value is encountered?
>
> That fix does not preclude a more thorough solution in the future, but it
> addresses the common case quickly.
>
> For existing data files we could check the writer version ignore filters on
> float/double. I don't know whether min/max filtering is common on
> float/double, but I suspect it's not.
>
> On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <ta...@cloudera.com>
> wrote:
>
> > There is an extensibility mechanism with the ColumnOrder union - I think
> > that was meant to avoid the need to add new stat fields?
> >
> > Given that the bug was in the Parquet spec, we'll need to make a spec
> > change anyway, so we could add a new ColumnOrder -
> FloatingPointTotalOrder?
> > at the same time as fixing the gap in the spec.
> >
> > It could make sense to declare that the default ordering for
> floats/doubles
> > is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
> > ordered) and readers should either implement the required logic to handle
> > that correctly (I had some ideas here:
> > https://issues.apache.org/jira/browse/IMPALA-6527?
> > focusedCommentId=16366106&page=com.atlassian.jira.
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> > or ignore the stats.
> >
> > On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com> wrote:
> >
> > > > We could have a similar problem
> > > > with not finding +0.0 values because a -0.0 is written to the
> max_value
> > > > field by some component that considers them the same.
> > >
> > > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > > under the real-number-inspired ordering, which is distinguished from
> > > total Ordering, and which is also what you get when you use the
> > > default C/C++ operators <, >, <=, ==, and so on.
> > >
> > > You can distinguish between -0.0 and +0.0 without using total ordering
> > > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > > ways to distinguish, I suspect, but that's the simplest one I recall
> > > at the moment.
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Alexander Behm <al...@cloudera.com>.

I hope the common cases is that data files do not contain these special
float values. As the simplest solution, how about writers refrain from
populating the stats if a special value is encountered?

That fix does not preclude a more thorough solution in the future, but it
addresses the common case quickly.

For existing data files we could check the writer version ignore filters on
float/double. I don't know whether min/max filtering is common on
float/double, but I suspect it's not.

On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <ta...@cloudera.com>
wrote:

> There is an extensibility mechanism with the ColumnOrder union - I think
> that was meant to avoid the need to add new stat fields?
>
> Given that the bug was in the Parquet spec, we'll need to make a spec
> change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder?
> at the same time as fixing the gap in the spec.
>
> It could make sense to declare that the default ordering for floats/doubles
> is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
> ordered) and readers should either implement the required logic to handle
> that correctly (I had some ideas here:
> https://issues.apache.org/jira/browse/IMPALA-6527?
> focusedCommentId=16366106&page=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> or ignore the stats.
>
> On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com> wrote:
>
> > > We could have a similar problem
> > > with not finding +0.0 values because a -0.0 is written to the max_value
> > > field by some component that considers them the same.
> >
> > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > under the real-number-inspired ordering, which is distinguished from
> > total Ordering, and which is also what you get when you use the
> > default C/C++ operators <, >, <=, ==, and so on.
> >
> > You can distinguish between -0.0 and +0.0 without using total ordering
> > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > ways to distinguish, I suspect, but that's the simplest one I recall
> > at the moment.
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Alexander Behm <al...@cloudera.com>.

I hope the common cases is that data files do not contain these special
float values. As the simplest solution, how about writers refrain from
populating the stats if a special value is encountered?

That fix does not preclude a more thorough solution in the future, but it
addresses the common case quickly.

For existing data files we could check the writer version ignore filters on
float/double. I don't know whether min/max filtering is common on
float/double, but I suspect it's not.

On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <ta...@cloudera.com>
wrote:

> There is an extensibility mechanism with the ColumnOrder union - I think
> that was meant to avoid the need to add new stat fields?
>
> Given that the bug was in the Parquet spec, we'll need to make a spec
> change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder?
> at the same time as fixing the gap in the spec.
>
> It could make sense to declare that the default ordering for floats/doubles
> is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
> ordered) and readers should either implement the required logic to handle
> that correctly (I had some ideas here:
> https://issues.apache.org/jira/browse/IMPALA-6527?
> focusedCommentId=16366106&page=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> or ignore the stats.
>
> On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com> wrote:
>
> > > We could have a similar problem
> > > with not finding +0.0 values because a -0.0 is written to the max_value
> > > field by some component that considers them the same.
> >
> > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > under the real-number-inspired ordering, which is distinguished from
> > total Ordering, and which is also what you get when you use the
> > default C/C++ operators <, >, <=, ==, and so on.
> >
> > You can distinguish between -0.0 and +0.0 without using total ordering
> > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > ways to distinguish, I suspect, but that's the simplest one I recall
> > at the moment.
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Tim Armstrong <ta...@cloudera.com>.

There is an extensibility mechanism with the ColumnOrder union - I think
that was meant to avoid the need to add new stat fields?

Given that the bug was in the Parquet spec, we'll need to make a spec
change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder?
at the same time as fixing the gap in the spec.

It could make sense to declare that the default ordering for floats/doubles
is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
ordered) and readers should either implement the required logic to handle
that correctly (I had some ideas here:
https://issues.apache.org/jira/browse/IMPALA-6527?focusedCommentId=16366106&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
or ignore the stats.

On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com> wrote:

> > We could have a similar problem
> > with not finding +0.0 values because a -0.0 is written to the max_value
> > field by some component that considers them the same.
>
> My hope is that the filtering would behave sanely, since -0.0 == +0.0
> under the real-number-inspired ordering, which is distinguished from
> total Ordering, and which is also what you get when you use the
> default C/C++ operators <, >, <=, ==, and so on.
>
> You can distinguish between -0.0 and +0.0 without using total ordering
> by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> ways to distinguish, I suspect, but that's the simplest one I recall
> at the moment.
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Tim Armstrong <ta...@cloudera.com>.

There is an extensibility mechanism with the ColumnOrder union - I think
that was meant to avoid the need to add new stat fields?

Given that the bug was in the Parquet spec, we'll need to make a spec
change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder?
at the same time as fixing the gap in the spec.

It could make sense to declare that the default ordering for floats/doubles
is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
ordered) and readers should either implement the required logic to handle
that correctly (I had some ideas here:
https://issues.apache.org/jira/browse/IMPALA-6527?focusedCommentId=16366106&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
or ignore the stats.

On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jb...@cloudera.com> wrote:

> > We could have a similar problem
> > with not finding +0.0 values because a -0.0 is written to the max_value
> > field by some component that considers them the same.
>
> My hope is that the filtering would behave sanely, since -0.0 == +0.0
> under the real-number-inspired ordering, which is distinguished from
> total Ordering, and which is also what you get when you use the
> default C/C++ operators <, >, <=, ==, and so on.
>
> You can distinguish between -0.0 and +0.0 without using total ordering
> by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> ways to distinguish, I suspect, but that's the simplest one I recall
> at the moment.
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Jim Apple <jb...@cloudera.com>.

> We could have a similar problem
> with not finding +0.0 values because a -0.0 is written to the max_value
> field by some component that considers them the same.

My hope is that the filtering would behave sanely, since -0.0 == +0.0
under the real-number-inspired ordering, which is distinguished from
total Ordering, and which is also what you get when you use the
default C/C++ operators <, >, <=, ==, and so on.

You can distinguish between -0.0 and +0.0 without using total ordering
by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
ways to distinguish, I suspect, but that's the simplest one I recall
at the moment.

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Jim Apple <jb...@cloudera.com>.

> We could have a similar problem
> with not finding +0.0 values because a -0.0 is written to the max_value
> field by some component that considers them the same.

My hope is that the filtering would behave sanely, since -0.0 == +0.0
under the real-number-inspired ordering, which is distinguished from
total Ordering, and which is also what you get when you use the
default C/C++ operators <, >, <=, ==, and so on.

You can distinguish between -0.0 and +0.0 without using total ordering
by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
ways to distinguish, I suspect, but that's the simplest one I recall
at the moment.

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Zoltan Ivanfi <zi...@cloudera.com>.

Hi,

I don't think it would be worth to keep a separate NaN count, but we could
ignore them when calculating min/max stats regardless. However, NaN is not
the only value preventing total ordering. We could have a similar problem
with not finding +0.0 values because a -0.0 is written to the max_value
field by some component that considers them the same.

One more thing to consider is how to deal with existing data when we update
the specs. One possibility is to specify legacy rules, like "if the stats
contain NaN and the file was written by Impala, it should be ignored", but
that would complicate the specs and be a burden on implementors. In fact,
min_value and max_value were introduced because we did not want to define
similar legacy rules for min and max. So should we deprecate min_value and
max_value as well and introduce yet_another_min and yet_another_max fields
instead (with nicer names, naturally)?

Br,

Zoltan

On Thu, Feb 15, 2018 at 8:01 PM Tim Armstrong <ta...@cloudera.com>
wrote:

> We could also consider treating NaN similar to NULL and having a separate
> piece of information with a count of NaN values (or just a bit indicating
> presence/absence of NaN). I'm not sure if that is easier or harder to
> implement than a total order.
>
> On Thu, Feb 15, 2018 at 9:12 AM, Laszlo Gaal <la...@cloudera.com>
> wrote:
>
> > To supply some context: Impala has had a number of issues
> > <https://issues.apache.org/jira/issues/?jql=project%
> > 3Dimpala%20and%20summary%20%20~%20NaN>
> > around NaN/infinity:
> >
> > The closest precedent related to the current issue seems to be
> IMPALA-6295
> > <https://issues.apache.org/jira/browse/IMPALA-6295>: "Inconsistent
> > handling
> > of 'nan' and 'inf' with min/max analytic fns": the discussion there
> offers
> > notable points on:
> > 1. How Impala handles similar problems in different (but related) areas,
> > 2. How other database products (Hive, PostgeSQL, etc.) handle similar
> > issues around NaNs/infinity (or infinities, in the case of IEEE-754).
> >
> > Thanks,
> >
> >     - LaszloG
> >
> >
> > On Thu, Feb 15, 2018 at 5:10 PM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >
> > > Dear Parquet and Impala Developers,
> > >
> > > We have exposed min/max statistics to extensive compatibility testing
> and
> > > found troubling inconsistencies regarding float and double values.
> Under
> > > certain (fortunately rather extreme) circumstances, this can lead to
> > > predicate pushdown incorrectly discarding row groups that contain
> > matching
> > > rows.
> > >
> > > The root of the problem seems to be that Impala (and probably
> parquet-cpp
> > > as well) uses C++ comparison operators for floating point numbers and
> > those
> > > do not provide a total ordering. This is actually in line with IEEE
> 754,
> > > according to which -0 is neither less nor more than +0 and comparing
> NaN
> > to
> > > anything always returns false. This, however is not suitable for
> > statistics
> > > and can lead to serious consequences that you can read more about in
> > > IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>.
> > >
> > > The IEEE 754 standard and the Java API, on the other hand, both
> provide a
> > > total ordering, but I'm not sure whether the two are the same. The java
> > > implementation looks relatively simple - both easy to understand and
> > > effective to execute. You can check it here
> > > <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> > > classes/java/lang/Double.java#l999>.
> > > The IEEE 754 total ordering, on the other hand looks rather complicated
> > to
> > > the extent that I can not decide whether the Java implementation
> adheres
> > to
> > > it. I couldn't find the whole standard online, but I found an excerpt
> > about
> > > the totalOrder predicate here
> > > <https://github.com/rust-lang/rust/issues/5585>. Additionally, I have
> > also
> > > found that IEEE 754-2008 defines min and max operations as described
> here
> > > <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that
> > > strangely *do not* adhere to a total ordering.
> > >
> > > I checked the specification in parquet-format but all I could find
> about
> > > floating point numbers is the following:
> > >
> > >    *   FLOAT - signed comparison of the represented value
> > >    *   DOUBLE - signed comparison of the represented value
> > >
> > > I suggest extending the specification to explicitly require
> > implementations
> > > to follow a specific comparison logic for these types. The candidates
> > are:
> > >
> > >    - The Java implementation
> > >    <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/
> > 687fd7c7986d/src/share/
> > > classes/java/lang/Double.java#l999>
> > >    which looks easy and efficient to implement in any language.
> > >    - The IEEE 754 totalOrder <https://github.com/rust-lang/
> > > rust/issues/5585>
> > >    predicate which honestly looks rather scary.
> > >    - The IEEE 754-2008 min and max
> > >    <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max>
> > > operations
> > >    which may be hard to use for comparison.
> > >
> > > I'm curious to hear your opinions.
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Zoltan Ivanfi <zi...@cloudera.com>.

Hi,

I don't think it would be worth to keep a separate NaN count, but we could
ignore them when calculating min/max stats regardless. However, NaN is not
the only value preventing total ordering. We could have a similar problem
with not finding +0.0 values because a -0.0 is written to the max_value
field by some component that considers them the same.

One more thing to consider is how to deal with existing data when we update
the specs. One possibility is to specify legacy rules, like "if the stats
contain NaN and the file was written by Impala, it should be ignored", but
that would complicate the specs and be a burden on implementors. In fact,
min_value and max_value were introduced because we did not want to define
similar legacy rules for min and max. So should we deprecate min_value and
max_value as well and introduce yet_another_min and yet_another_max fields
instead (with nicer names, naturally)?

Br,

Zoltan

On Thu, Feb 15, 2018 at 8:01 PM Tim Armstrong <ta...@cloudera.com>
wrote:

> We could also consider treating NaN similar to NULL and having a separate
> piece of information with a count of NaN values (or just a bit indicating
> presence/absence of NaN). I'm not sure if that is easier or harder to
> implement than a total order.
>
> On Thu, Feb 15, 2018 at 9:12 AM, Laszlo Gaal <la...@cloudera.com>
> wrote:
>
> > To supply some context: Impala has had a number of issues
> > <https://issues.apache.org/jira/issues/?jql=project%
> > 3Dimpala%20and%20summary%20%20~%20NaN>
> > around NaN/infinity:
> >
> > The closest precedent related to the current issue seems to be
> IMPALA-6295
> > <https://issues.apache.org/jira/browse/IMPALA-6295>: "Inconsistent
> > handling
> > of 'nan' and 'inf' with min/max analytic fns": the discussion there
> offers
> > notable points on:
> > 1. How Impala handles similar problems in different (but related) areas,
> > 2. How other database products (Hive, PostgeSQL, etc.) handle similar
> > issues around NaNs/infinity (or infinities, in the case of IEEE-754).
> >
> > Thanks,
> >
> >     - LaszloG
> >
> >
> > On Thu, Feb 15, 2018 at 5:10 PM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >
> > > Dear Parquet and Impala Developers,
> > >
> > > We have exposed min/max statistics to extensive compatibility testing
> and
> > > found troubling inconsistencies regarding float and double values.
> Under
> > > certain (fortunately rather extreme) circumstances, this can lead to
> > > predicate pushdown incorrectly discarding row groups that contain
> > matching
> > > rows.
> > >
> > > The root of the problem seems to be that Impala (and probably
> parquet-cpp
> > > as well) uses C++ comparison operators for floating point numbers and
> > those
> > > do not provide a total ordering. This is actually in line with IEEE
> 754,
> > > according to which -0 is neither less nor more than +0 and comparing
> NaN
> > to
> > > anything always returns false. This, however is not suitable for
> > statistics
> > > and can lead to serious consequences that you can read more about in
> > > IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>.
> > >
> > > The IEEE 754 standard and the Java API, on the other hand, both
> provide a
> > > total ordering, but I'm not sure whether the two are the same. The java
> > > implementation looks relatively simple - both easy to understand and
> > > effective to execute. You can check it here
> > > <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> > > classes/java/lang/Double.java#l999>.
> > > The IEEE 754 total ordering, on the other hand looks rather complicated
> > to
> > > the extent that I can not decide whether the Java implementation
> adheres
> > to
> > > it. I couldn't find the whole standard online, but I found an excerpt
> > about
> > > the totalOrder predicate here
> > > <https://github.com/rust-lang/rust/issues/5585>. Additionally, I have
> > also
> > > found that IEEE 754-2008 defines min and max operations as described
> here
> > > <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that
> > > strangely *do not* adhere to a total ordering.
> > >
> > > I checked the specification in parquet-format but all I could find
> about
> > > floating point numbers is the following:
> > >
> > >    *   FLOAT - signed comparison of the represented value
> > >    *   DOUBLE - signed comparison of the represented value
> > >
> > > I suggest extending the specification to explicitly require
> > implementations
> > > to follow a specific comparison logic for these types. The candidates
> > are:
> > >
> > >    - The Java implementation
> > >    <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/
> > 687fd7c7986d/src/share/
> > > classes/java/lang/Double.java#l999>
> > >    which looks easy and efficient to implement in any language.
> > >    - The IEEE 754 totalOrder <https://github.com/rust-lang/
> > > rust/issues/5585>
> > >    predicate which honestly looks rather scary.
> > >    - The IEEE 754-2008 min and max
> > >    <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max>
> > > operations
> > >    which may be hard to use for comparison.
> > >
> > > I'm curious to hear your opinions.
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Tim Armstrong <ta...@cloudera.com>.

We could also consider treating NaN similar to NULL and having a separate
piece of information with a count of NaN values (or just a bit indicating
presence/absence of NaN). I'm not sure if that is easier or harder to
implement than a total order.

On Thu, Feb 15, 2018 at 9:12 AM, Laszlo Gaal <la...@cloudera.com>
wrote:

> To supply some context: Impala has had a number of issues
> <https://issues.apache.org/jira/issues/?jql=project%
> 3Dimpala%20and%20summary%20%20~%20NaN>
> around NaN/infinity:
>
> The closest precedent related to the current issue seems to be IMPALA-6295
> <https://issues.apache.org/jira/browse/IMPALA-6295>: "Inconsistent
> handling
> of 'nan' and 'inf' with min/max analytic fns": the discussion there offers
> notable points on:
> 1. How Impala handles similar problems in different (but related) areas,
> 2. How other database products (Hive, PostgeSQL, etc.) handle similar
> issues around NaNs/infinity (or infinities, in the case of IEEE-754).
>
> Thanks,
>
>     - LaszloG
>
>
> On Thu, Feb 15, 2018 at 5:10 PM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
>
> > Dear Parquet and Impala Developers,
> >
> > We have exposed min/max statistics to extensive compatibility testing and
> > found troubling inconsistencies regarding float and double values. Under
> > certain (fortunately rather extreme) circumstances, this can lead to
> > predicate pushdown incorrectly discarding row groups that contain
> matching
> > rows.
> >
> > The root of the problem seems to be that Impala (and probably parquet-cpp
> > as well) uses C++ comparison operators for floating point numbers and
> those
> > do not provide a total ordering. This is actually in line with IEEE 754,
> > according to which -0 is neither less nor more than +0 and comparing NaN
> to
> > anything always returns false. This, however is not suitable for
> statistics
> > and can lead to serious consequences that you can read more about in
> > IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>.
> >
> > The IEEE 754 standard and the Java API, on the other hand, both provide a
> > total ordering, but I'm not sure whether the two are the same. The java
> > implementation looks relatively simple - both easy to understand and
> > effective to execute. You can check it here
> > <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> > classes/java/lang/Double.java#l999>.
> > The IEEE 754 total ordering, on the other hand looks rather complicated
> to
> > the extent that I can not decide whether the Java implementation adheres
> to
> > it. I couldn't find the whole standard online, but I found an excerpt
> about
> > the totalOrder predicate here
> > <https://github.com/rust-lang/rust/issues/5585>. Additionally, I have
> also
> > found that IEEE 754-2008 defines min and max operations as described here
> > <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that
> > strangely *do not* adhere to a total ordering.
> >
> > I checked the specification in parquet-format but all I could find about
> > floating point numbers is the following:
> >
> >    *   FLOAT - signed comparison of the represented value
> >    *   DOUBLE - signed comparison of the represented value
> >
> > I suggest extending the specification to explicitly require
> implementations
> > to follow a specific comparison logic for these types. The candidates
> are:
> >
> >    - The Java implementation
> >    <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/
> 687fd7c7986d/src/share/
> > classes/java/lang/Double.java#l999>
> >    which looks easy and efficient to implement in any language.
> >    - The IEEE 754 totalOrder <https://github.com/rust-lang/
> > rust/issues/5585>
> >    predicate which honestly looks rather scary.
> >    - The IEEE 754-2008 min and max
> >    <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max>
> > operations
> >    which may be hard to use for comparison.
> >
> > I'm curious to hear your opinions.
> >
> > Thanks,
> >
> > Zoltan
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Tim Armstrong <ta...@cloudera.com>.

We could also consider treating NaN similar to NULL and having a separate
piece of information with a count of NaN values (or just a bit indicating
presence/absence of NaN). I'm not sure if that is easier or harder to
implement than a total order.

On Thu, Feb 15, 2018 at 9:12 AM, Laszlo Gaal <la...@cloudera.com>
wrote:

> To supply some context: Impala has had a number of issues
> <https://issues.apache.org/jira/issues/?jql=project%
> 3Dimpala%20and%20summary%20%20~%20NaN>
> around NaN/infinity:
>
> The closest precedent related to the current issue seems to be IMPALA-6295
> <https://issues.apache.org/jira/browse/IMPALA-6295>: "Inconsistent
> handling
> of 'nan' and 'inf' with min/max analytic fns": the discussion there offers
> notable points on:
> 1. How Impala handles similar problems in different (but related) areas,
> 2. How other database products (Hive, PostgeSQL, etc.) handle similar
> issues around NaNs/infinity (or infinities, in the case of IEEE-754).
>
> Thanks,
>
>     - LaszloG
>
>
> On Thu, Feb 15, 2018 at 5:10 PM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
>
> > Dear Parquet and Impala Developers,
> >
> > We have exposed min/max statistics to extensive compatibility testing and
> > found troubling inconsistencies regarding float and double values. Under
> > certain (fortunately rather extreme) circumstances, this can lead to
> > predicate pushdown incorrectly discarding row groups that contain
> matching
> > rows.
> >
> > The root of the problem seems to be that Impala (and probably parquet-cpp
> > as well) uses C++ comparison operators for floating point numbers and
> those
> > do not provide a total ordering. This is actually in line with IEEE 754,
> > according to which -0 is neither less nor more than +0 and comparing NaN
> to
> > anything always returns false. This, however is not suitable for
> statistics
> > and can lead to serious consequences that you can read more about in
> > IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>.
> >
> > The IEEE 754 standard and the Java API, on the other hand, both provide a
> > total ordering, but I'm not sure whether the two are the same. The java
> > implementation looks relatively simple - both easy to understand and
> > effective to execute. You can check it here
> > <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> > classes/java/lang/Double.java#l999>.
> > The IEEE 754 total ordering, on the other hand looks rather complicated
> to
> > the extent that I can not decide whether the Java implementation adheres
> to
> > it. I couldn't find the whole standard online, but I found an excerpt
> about
> > the totalOrder predicate here
> > <https://github.com/rust-lang/rust/issues/5585>. Additionally, I have
> also
> > found that IEEE 754-2008 defines min and max operations as described here
> > <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that
> > strangely *do not* adhere to a total ordering.
> >
> > I checked the specification in parquet-format but all I could find about
> > floating point numbers is the following:
> >
> >    *   FLOAT - signed comparison of the represented value
> >    *   DOUBLE - signed comparison of the represented value
> >
> > I suggest extending the specification to explicitly require
> implementations
> > to follow a specific comparison logic for these types. The candidates
> are:
> >
> >    - The Java implementation
> >    <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/
> 687fd7c7986d/src/share/
> > classes/java/lang/Double.java#l999>
> >    which looks easy and efficient to implement in any language.
> >    - The IEEE 754 totalOrder <https://github.com/rust-lang/
> > rust/issues/5585>
> >    predicate which honestly looks rather scary.
> >    - The IEEE 754-2008 min and max
> >    <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max>
> > operations
> >    which may be hard to use for comparison.
> >
> > I'm curious to hear your opinions.
> >
> > Thanks,
> >
> > Zoltan
> >
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Laszlo Gaal <la...@cloudera.com>.

To supply some context: Impala has had a number of issues
<https://issues.apache.org/jira/issues/?jql=project%3Dimpala%20and%20summary%20%20~%20NaN>
around NaN/infinity:

The closest precedent related to the current issue seems to be IMPALA-6295
<https://issues.apache.org/jira/browse/IMPALA-6295>: "Inconsistent handling
of 'nan' and 'inf' with min/max analytic fns": the discussion there offers
notable points on:
1. How Impala handles similar problems in different (but related) areas,
2. How other database products (Hive, PostgeSQL, etc.) handle similar
issues around NaNs/infinity (or infinities, in the case of IEEE-754).

Thanks,

    - LaszloG


On Thu, Feb 15, 2018 at 5:10 PM, Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Dear Parquet and Impala Developers,
>
> We have exposed min/max statistics to extensive compatibility testing and
> found troubling inconsistencies regarding float and double values. Under
> certain (fortunately rather extreme) circumstances, this can lead to
> predicate pushdown incorrectly discarding row groups that contain matching
> rows.
>
> The root of the problem seems to be that Impala (and probably parquet-cpp
> as well) uses C++ comparison operators for floating point numbers and those
> do not provide a total ordering. This is actually in line with IEEE 754,
> according to which -0 is neither less nor more than +0 and comparing NaN to
> anything always returns false. This, however is not suitable for statistics
> and can lead to serious consequences that you can read more about in
> IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>.
>
> The IEEE 754 standard and the Java API, on the other hand, both provide a
> total ordering, but I'm not sure whether the two are the same. The java
> implementation looks relatively simple - both easy to understand and
> effective to execute. You can check it here
> <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> classes/java/lang/Double.java#l999>.
> The IEEE 754 total ordering, on the other hand looks rather complicated to
> the extent that I can not decide whether the Java implementation adheres to
> it. I couldn't find the whole standard online, but I found an excerpt about
> the totalOrder predicate here
> <https://github.com/rust-lang/rust/issues/5585>. Additionally, I have also
> found that IEEE 754-2008 defines min and max operations as described here
> <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that
> strangely *do not* adhere to a total ordering.
>
> I checked the specification in parquet-format but all I could find about
> floating point numbers is the following:
>
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
>
> I suggest extending the specification to explicitly require implementations
> to follow a specific comparison logic for these types. The candidates are:
>
>    - The Java implementation
>    <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> classes/java/lang/Double.java#l999>
>    which looks easy and efficient to implement in any language.
>    - The IEEE 754 totalOrder <https://github.com/rust-lang/
> rust/issues/5585>
>    predicate which honestly looks rather scary.
>    - The IEEE 754-2008 min and max
>    <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max>
> operations
>    which may be hard to use for comparison.
>
> I'm curious to hear your opinions.
>
> Thanks,
>
> Zoltan
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Posted by Laszlo Gaal <la...@cloudera.com>.

To supply some context: Impala has had a number of issues
<https://issues.apache.org/jira/issues/?jql=project%3Dimpala%20and%20summary%20%20~%20NaN>
around NaN/infinity:

The closest precedent related to the current issue seems to be IMPALA-6295
<https://issues.apache.org/jira/browse/IMPALA-6295>: "Inconsistent handling
of 'nan' and 'inf' with min/max analytic fns": the discussion there offers
notable points on:
1. How Impala handles similar problems in different (but related) areas,
2. How other database products (Hive, PostgeSQL, etc.) handle similar
issues around NaNs/infinity (or infinities, in the case of IEEE-754).

Thanks,

    - LaszloG


On Thu, Feb 15, 2018 at 5:10 PM, Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Dear Parquet and Impala Developers,
>
> We have exposed min/max statistics to extensive compatibility testing and
> found troubling inconsistencies regarding float and double values. Under
> certain (fortunately rather extreme) circumstances, this can lead to
> predicate pushdown incorrectly discarding row groups that contain matching
> rows.
>
> The root of the problem seems to be that Impala (and probably parquet-cpp
> as well) uses C++ comparison operators for floating point numbers and those
> do not provide a total ordering. This is actually in line with IEEE 754,
> according to which -0 is neither less nor more than +0 and comparing NaN to
> anything always returns false. This, however is not suitable for statistics
> and can lead to serious consequences that you can read more about in
> IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>.
>
> The IEEE 754 standard and the Java API, on the other hand, both provide a
> total ordering, but I'm not sure whether the two are the same. The java
> implementation looks relatively simple - both easy to understand and
> effective to execute. You can check it here
> <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> classes/java/lang/Double.java#l999>.
> The IEEE 754 total ordering, on the other hand looks rather complicated to
> the extent that I can not decide whether the Java implementation adheres to
> it. I couldn't find the whole standard online, but I found an excerpt about
> the totalOrder predicate here
> <https://github.com/rust-lang/rust/issues/5585>. Additionally, I have also
> found that IEEE 754-2008 defines min and max operations as described here
> <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that
> strangely *do not* adhere to a total ordering.
>
> I checked the specification in parquet-format but all I could find about
> floating point numbers is the following:
>
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
>
> I suggest extending the specification to explicitly require implementations
> to follow a specific comparison logic for these types. The candidates are:
>
>    - The Java implementation
>    <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> classes/java/lang/Double.java#l999>
>    which looks easy and efficient to implement in any language.
>    - The IEEE 754 totalOrder <https://github.com/rust-lang/
> rust/issues/5585>
>    predicate which honestly looks rather scary.
>    - The IEEE 754-2008 min and max
>    <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max>
> operations
>    which may be hard to use for comparison.
>
> I'm curious to hear your opinions.
>
> Thanks,
>
> Zoltan
>