You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Phillip Cloud <cp...@gmail.com> on 2018/02/14 01:50:16 UTC

Decimal NaNs

Recently someone opened ARROW-2145
<https://issues.apache.org/jira/projects/ARROW/issues/ARROW-2145> asking
for support for non-finite values, such as NaN and infinity.
It may seem like a “no-brainer” to implement this, but there’s no real
consistency on how to implement it or *even to implement it at all*:

   - Java BigDecimal: raises an exception for nan or inf as per the docs
   <https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#BigDecimal-double->
   - boost multiprecision supports it but not for fixed precision decimal
   numbers (cpp_bin_float/cpp_dec_float, which are arbitrary precision
   floating point not fixed point)
   - python supports it using flags and special string exponents (and it
   supports both signaling and quiet nans)
   - impala doesn’t support it (returns null when you try to perform
CAST(CAST('NaN'
   AS DOUBLE) AS DECIMAL)
   - postgres supports it with its numeric
   <https://www.postgresql.org/docs/10/static/datatype-numeric.html> type
   by using the sign member of the C struct backing numeric values
   <https://github.com/postgres/postgres/blob/c7b8998ebbf310a156aa38022555a24d98fdbfb4/src/interfaces/ecpg/include/pgtypes_numeric.h#L16-L25>
   - MySQL: doesn’t even support nan/inf!

The lack of support for these values across languages likely stems from the
fact that fixed precision arithmetic by definition must happen on finite
values, and nan/inf are not finite values therefore they are not supported.

We could go down this rabbit hole in the name of providing support for
Python decimal.Decimal(<non-finite value>) but I’m not sure how useful it
is.

No other system except in-memory C++ arrow arrays would be able to operate
on these values (I suppose we could add a wrapper around BigDecimal that
has the desired behavior).

For example, writing arrow arrays containing Decimal128 values (with nans
or infs) to a parquet file seems untenable.

Additionally, if we decided to implement it, we’d likely have to take
something like the flag approach which would require a change to the
metadata (not necessary a bad thing) that would add two bitmaps to arrow
Decimal arrays: one for indicating nan-ness and one for indicating inf-ness
(that’s a ton of overhead IMO when I think it’s likely that most values are
always finite).

I’m skeptical about whether we should support this.

Thoughts?

Re: Decimal NaNs

Posted by Wes McKinney <we...@gmail.com>.

hey Phillip,

Replying so we have a record on the mailing list about this. For the
user's use case, it seems that having reasonable null support for
decimals incoming from Python would be sufficient.

I agree it's probably not worth supporting NaN decimals in the Arrow
format for now given the sporadic support across the ecosystem.

- Wes

On Tue, Feb 13, 2018 at 8:50 PM, Phillip Cloud <cp...@gmail.com> wrote:
> Recently someone opened ARROW-2145
> <https://issues.apache.org/jira/projects/ARROW/issues/ARROW-2145> asking
> for support for non-finite values, such as NaN and infinity.
> It may seem like a “no-brainer” to implement this, but there’s no real
> consistency on how to implement it or *even to implement it at all*:
>
>    - Java BigDecimal: raises an exception for nan or inf as per the docs
>    <https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#BigDecimal-double->
>    - boost multiprecision supports it but not for fixed precision decimal
>    numbers (cpp_bin_float/cpp_dec_float, which are arbitrary precision
>    floating point not fixed point)
>    - python supports it using flags and special string exponents (and it
>    supports both signaling and quiet nans)
>    - impala doesn’t support it (returns null when you try to perform
> CAST(CAST('NaN'
>    AS DOUBLE) AS DECIMAL)
>    - postgres supports it with its numeric
>    <https://www.postgresql.org/docs/10/static/datatype-numeric.html> type
>    by using the sign member of the C struct backing numeric values
>    <https://github.com/postgres/postgres/blob/c7b8998ebbf310a156aa38022555a24d98fdbfb4/src/interfaces/ecpg/include/pgtypes_numeric.h#L16-L25>
>    - MySQL: doesn’t even support nan/inf!
>
> The lack of support for these values across languages likely stems from the
> fact that fixed precision arithmetic by definition must happen on finite
> values, and nan/inf are not finite values therefore they are not supported.
>
> We could go down this rabbit hole in the name of providing support for
> Python decimal.Decimal(<non-finite value>) but I’m not sure how useful it
> is.
>
> No other system except in-memory C++ arrow arrays would be able to operate
> on these values (I suppose we could add a wrapper around BigDecimal that
> has the desired behavior).
>
> For example, writing arrow arrays containing Decimal128 values (with nans
> or infs) to a parquet file seems untenable.
>
> Additionally, if we decided to implement it, we’d likely have to take
> something like the flag approach which would require a change to the
> metadata (not necessary a bad thing) that would add two bitmaps to arrow
> Decimal arrays: one for indicating nan-ness and one for indicating inf-ness
> (that’s a ton of overhead IMO when I think it’s likely that most values are
> always finite).
>
> I’m skeptical about whether we should support this.
>
> Thoughts?
>