You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Rhys Ulerich <Rh...@twosigma.com> on 2018/12/10 15:47:21 UTC

valid NaNs versus invalid NaNs?

'Morning,



Regarding https://arrow.apache.org/docs/memory_layout.html, how should is_valid be interpreted for primitive types that have their own notions of is_valid?



Concretely, how should folks interpret a "valid NaN" (is_valid 1 with float NaN) versus an "invalid NaN" (is valid 0 with float NaN)?  In RFC-ese, MUST individual NaNs be valid?  Or, MUST floats all be valid by omitting the validity bitset?



I ask because otherwise I can see a bunch of different systems interpreting this detail in many different ways.  That'd be an interop nightmare.  Especially since understanding why NaNs sneak into large datasets is already quite a hassle.



Anyhow, it seems worth addressing this gap at the written specification level.



(Apologies if this has been discussed previously-- I've found no searchable mailing list archives under http://mail-archives.apache.org/mod_mbox/arrow-dev/ or https://cwiki.apache.org/confluence/display/ARROW.)



Thanks,

Rhys

Re: valid NaNs versus invalid NaNs?

Posted by Donald Foss <do...@gmail.com>.
Alternately Rhys, what Wes said. :)

Donald E. Foss | @DonaldFoss <https://twitter.com/DonaldFoss>
Never Stop Learning!
------ __o
----_`\<,_
---(_)/ (_)

> On Dec 10, 2018, at 11:23 AM, Donald Foss <do...@gmail.com> wrote:
> 
> +1 on NaNs being an interop nightmare already, especially for those who work with multiple coding languages at the same time.
> 
> Issues regarding NaNs may be found at https://issues.apache.org/jira/browse/ARROW-2806?jql=text%20~%20%22NaN%22 <https://issues.apache.org/jira/browse/ARROW-2806?jql=text%20~%20%22NaN%22>. The last issue I see was from July 2018, with Python, and marked resolved 17 July 2018. The description may be helpful.
> 
> Regards,
> 
> Donald E. Foss | @DonaldFoss <https://twitter.com/DonaldFoss>
> Never Stop Learning!
> ------ __o
> ----_`\<,_
> ---(_)/ (_)
> 
>> On Dec 10, 2018, at 10:47 AM, Rhys Ulerich <Rhys.Ulerich@twosigma.com <ma...@twosigma.com>> wrote:
>> 
>> 'Morning,
>> 
>> 
>> 
>> Regarding https://arrow.apache.org/docs/memory_layout.html <https://arrow.apache.org/docs/memory_layout.html>, how should is_valid be interpreted for primitive types that have their own notions of is_valid?
>> 
>> 
>> 
>> Concretely, how should folks interpret a "valid NaN" (is_valid 1 with float NaN) versus an "invalid NaN" (is valid 0 with float NaN)?  In RFC-ese, MUST individual NaNs be valid?  Or, MUST floats all be valid by omitting the validity bitset?
>> 
>> 
>> 
>> I ask because otherwise I can see a bunch of different systems interpreting this detail in many different ways.  That'd be an interop nightmare.  Especially since understanding why NaNs sneak into large datasets is already quite a hassle.
>> 
>> 
>> 
>> Anyhow, it seems worth addressing this gap at the written specification level.
>> 
>> 
>> 
>> (Apologies if this has been discussed previously-- I've found no searchable mailing list archives under http://mail-archives.apache.org/mod_mbox/arrow-dev/ <http://mail-archives.apache.org/mod_mbox/arrow-dev/> or https://cwiki.apache.org/confluence/display/ARROW <https://cwiki.apache.org/confluence/display/ARROW>.)
>> 
>> 
>> 
>> Thanks,
>> 
>> Rhys
> 


Re: valid NaNs versus invalid NaNs?

Posted by Donald Foss <do...@gmail.com>.
+1 on NaNs being an interop nightmare already, especially for those who work with multiple coding languages at the same time.

Issues regarding NaNs may be found at https://issues.apache.org/jira/browse/ARROW-2806?jql=text%20~%20%22NaN%22 <https://issues.apache.org/jira/browse/ARROW-2806?jql=text%20~%20%22NaN%22>. The last issue I see was from July 2018, with Python, and marked resolved 17 July 2018. The description may be helpful.

Regards,

Donald E. Foss | @DonaldFoss <https://twitter.com/DonaldFoss>
Never Stop Learning!
------ __o
----_`\<,_
---(_)/ (_)

> On Dec 10, 2018, at 10:47 AM, Rhys Ulerich <Rh...@twosigma.com> wrote:
> 
> 'Morning,
> 
> 
> 
> Regarding https://arrow.apache.org/docs/memory_layout.html, how should is_valid be interpreted for primitive types that have their own notions of is_valid?
> 
> 
> 
> Concretely, how should folks interpret a "valid NaN" (is_valid 1 with float NaN) versus an "invalid NaN" (is valid 0 with float NaN)?  In RFC-ese, MUST individual NaNs be valid?  Or, MUST floats all be valid by omitting the validity bitset?
> 
> 
> 
> I ask because otherwise I can see a bunch of different systems interpreting this detail in many different ways.  That'd be an interop nightmare.  Especially since understanding why NaNs sneak into large datasets is already quite a hassle.
> 
> 
> 
> Anyhow, it seems worth addressing this gap at the written specification level.
> 
> 
> 
> (Apologies if this has been discussed previously-- I've found no searchable mailing list archives under http://mail-archives.apache.org/mod_mbox/arrow-dev/ or https://cwiki.apache.org/confluence/display/ARROW.)
> 
> 
> 
> Thanks,
> 
> Rhys


RE: valid NaNs versus invalid NaNs?

Posted by Rhys Ulerich <Rh...@twosigma.com>.
>> Anyhow, it seems worth addressing this gap at the written specification level.
> What would you suggest? We could add a statement to be explicit that no special / sentinel values (which includes NaN) are recognized as null.

I like your suggestion Wes.  Please consider making that amendment (or similar) in the next specification update.

Cheers,
Rhys

Re: valid NaNs versus invalid NaNs?

Posted by Wes McKinney <we...@gmail.com>.
hi Rhys,

On Mon, Dec 10, 2018 at 9:53 AM Rhys Ulerich <Rh...@twosigma.com> wrote:
>
> 'Morning,
>
>
>
> Regarding https://arrow.apache.org/docs/memory_layout.html, how should is_valid be interpreted for primitive types that have their own notions of is_valid?
>
>
>
> Concretely, how should folks interpret a "valid NaN" (is_valid 1 with float NaN) versus an "invalid NaN" (is valid 0 with float NaN)?  In RFC-ese, MUST individual NaNs be valid?  Or, MUST floats all be valid by omitting the validity bitset?
>

In floating point types, NaN is a valid value. I think you're talking
about systems that use sentinel values to represent nulls. The Arrow
columnar format does not have any notion of sentinel values. So if you
want other Arrow systems to recognize your values as being null, then
you must construct the validity bitmap accordingly.

>
>
> I ask because otherwise I can see a bunch of different systems interpreting this detail in many different ways.  That'd be an interop nightmare.  Especially since understanding why NaNs sneak into large datasets is already quite a hassle.
>

It is up to applications to determine what NaN means. It would not be
appropriate for Arrow to assume anything, particularly since most
database systems (AFAIK) distinguish NaN and NULL.

For example, in Python interop, we recognize NaN as null when
converting to Arrow, but _only_ if the data originated from pandas:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/type_traits.h#L102

In [1]: import pyarrow as pa

In [2]: import numpy as np

In [3]: arr = np.array([1, np.nan])

In [4]: arr1 = pa.array(arr)

In [5]: arr2 = pa.array(arr, from_pandas=True)

In [6]: arr1
Out[6]:
<pyarrow.lib.DoubleArray object at 0x7ffa3c8a1188>
[
  1,
  nan
]

In [7]: arr2
Out[7]:
<pyarrow.lib.DoubleArray object at 0x7ffa1ef42bd8>
[
  1,
  null
]

In [8]: arr1.null_count
Out[8]: 0

In [9]: arr2.null_count
Out[9]: 1

In R, NaN and NA are distinct

https://github.com/apache/arrow/commit/3ab4a0f481211c5d115845519eb9398dc02e2e24#diff-4b43b0aee35624cd95b910189b3dc231

>
>
> Anyhow, it seems worth addressing this gap at the written specification level.
>

What would you suggest? We could add a statement to be explicit that
no special / sentinel values (which includes NaN) are recognized as
null.

- Wes

>
>
> (Apologies if this has been discussed previously-- I've found no searchable mailing list archives under http://mail-archives.apache.org/mod_mbox/arrow-dev/ or https://cwiki.apache.org/confluence/display/ARROW.)
>
>
>
> Thanks,
>
> Rhys