You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Li Jin <ic...@gmail.com> on 2021/06/08 19:59:26 UTC

Representation of "null" values for non-numeric types in Arrow/Pandas interop

Hello!

Apologies if this has been brought before. I'd like to get devs' thoughts
on this potential inconsistency of "what are the python objects for null
values" between pandas and pyarrow.

Demonstrated with the following example:

(1)  pandas seems to use "np.NaN" to represent a missing value (with pandas
1.2.4):

In [*32*]: df

Out[*32*]:

           value

key

1    some_strign


In [*33*]: df2

Out[*33*]:

                value2

key

2    some_other_string


In [*34*]: df.join(df2)

Out[*34*]:

           value value2

key

1    some_strign    *NaN*



(2) pyarrow seems to use "None" to represent a missing value (4.0.1)

>>> s = pd.Series(["some_string", np.NaN])

>>> s

0    some_string

1            NaN

dtype: object

>>> pa.Array.from_pandas(s).to_pandas()

0    some_string

1           None

dtype: object


I have looked around the pyarrow doc and didn't find an option to use
np.NaN for null values with to_pandas so it's a bit hard to get around trip
consistency.


I appreciate any thoughts on this as to how to achieve consistency here.


Thanks!

Li

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Semantically, a NaN is defined according to the IEEE_754 for floating
points, while a null represents any value whose value is undefined,
unknown, etc.

An important set of problems that arrow solves is that it has a native
representation for null values (independent of NaNs): arrow's in-memory
model is designed ground up to support nulls; other in-memory
representations sometimes use NaN or some other variations to represent
nulls, which sometimes results in breaking memory alignments useful in
compute.

In Arrow, the value of a floating point array can be "non-null" or "null".
When non-null, it can be any valid value for the corresponding type. For
floats, that means any valid floating point number, including, NaN, inf,
-0.0, 0.0, etc.

Best,
Jorge



On Tue, Jun 8, 2021 at 9:59 PM Li Jin <ic...@gmail.com> wrote:

> Hello!
>
> Apologies if this has been brought before. I'd like to get devs' thoughts
> on this potential inconsistency of "what are the python objects for null
> values" between pandas and pyarrow.
>
> Demonstrated with the following example:
>
> (1)  pandas seems to use "np.NaN" to represent a missing value (with pandas
> 1.2.4):
>
> In [*32*]: df
>
> Out[*32*]:
>
>            value
>
> key
>
> 1    some_strign
>
>
> In [*33*]: df2
>
> Out[*33*]:
>
>                 value2
>
> key
>
> 2    some_other_string
>
>
> In [*34*]: df.join(df2)
>
> Out[*34*]:
>
>            value value2
>
> key
>
> 1    some_strign    *NaN*
>
>
>
> (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
>
> >>> s = pd.Series(["some_string", np.NaN])
>
> >>> s
>
> 0    some_string
>
> 1            NaN
>
> dtype: object
>
> >>> pa.Array.from_pandas(s).to_pandas()
>
> 0    some_string
>
> 1           None
>
> dtype: object
>
>
> I have looked around the pyarrow doc and didn't find an option to use
> np.NaN for null values with to_pandas so it's a bit hard to get around trip
> consistency.
>
>
> I appreciate any thoughts on this as to how to achieve consistency here.
>
>
> Thanks!
>
> Li
>

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Posted by Wes McKinney <we...@gmail.com>.

To my knowledge, "None" has always been the preferred null sentinel
value for object-dtype arrays in pandas, but since sometimes these
arrays originate from transposes or other join/append operations that
merge numeric arrays (which have NaN sentinels) into non-numeric
arrays to create object arrays, we were forced to deal with multiple
possible sentinel values.

All of this is a bit of an unfortunate artifact of pandas's use of
sentinel values and permissiveness around mixed-type arrays, and one
of the motivations I had for helping build the Arrow project in the
first place: to be the data structure and computing platform that I
wish had existed more than a decade ago.

On Wed, Jun 9, 2021 at 2:29 AM Joris Van den Bossche
<jo...@gmail.com> wrote:
>
> That won't help in this specific case, since it is for an array of
> strings (which you can't fill with NaN), and for floating point
> arrays, we already use np.nan as "null" representation when converting
> to numpy/pandas.
>
> On Wed, 9 Jun 2021 at 03:37, Benjamin Kietzman <be...@gmail.com> wrote:
> >
> > As a workaround, the "fill_null" compute function can be used to replace
> > nulls with nans:
> >
> > >>> nan = pa.scalar(np.NaN, type=pa.float64())
> > >>> pa.Array.from_pandas(s).fill_null(nan).to_pandas()
> >
> > On Tue, Jun 8, 2021, 16:15 Joris Van den Bossche <
> > jorisvandenbossche@gmail.com> wrote:
> >
> > > Hi Li,
> > >
> > > It's correct that arrow uses "None" for null values when converting a
> > > string array to numpy / pandas.
> > > As far as I am aware, there is currently no option to control that
> > > (and to make it use np.nan instead), and I am not sure there would be
> > > much interest in adding such an option.
> > >
> > > Now, I know this doesn't give an exact roundtrip in this case, but
> > > pandas does treat both np.nan and None as missing values in object
> > > dtype columns, so behaviour-wise this shouldn't give any difference
> > > and the roundtrip is still faithful on that aspect.
> > >
> > > Best,
> > > Joris
> > >
> > > On Tue, 8 Jun 2021 at 21:59, Li Jin <ic...@gmail.com> wrote:
> > > >
> > > > Hello!
> > > >
> > > > Apologies if this has been brought before. I'd like to get devs' thoughts
> > > > on this potential inconsistency of "what are the python objects for null
> > > > values" between pandas and pyarrow.
> > > >
> > > > Demonstrated with the following example:
> > > >
> > > > (1)  pandas seems to use "np.NaN" to represent a missing value (with
> > > pandas
> > > > 1.2.4):
> > > >
> > > > In [*32*]: df
> > > >
> > > > Out[*32*]:
> > > >
> > > >            value
> > > >
> > > > key
> > > >
> > > > 1    some_strign
> > > >
> > > >
> > > > In [*33*]: df2
> > > >
> > > > Out[*33*]:
> > > >
> > > >                 value2
> > > >
> > > > key
> > > >
> > > > 2    some_other_string
> > > >
> > > >
> > > > In [*34*]: df.join(df2)
> > > >
> > > > Out[*34*]:
> > > >
> > > >            value value2
> > > >
> > > > key
> > > >
> > > > 1    some_strign    *NaN*
> > > >
> > > >
> > > >
> > > > (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
> > > >
> > > > >>> s = pd.Series(["some_string", np.NaN])
> > > >
> > > > >>> s
> > > >
> > > > 0    some_string
> > > >
> > > > 1            NaN
> > > >
> > > > dtype: object
> > > >
> > > > >>> pa.Array.from_pandas(s).to_pandas()
> > > >
> > > > 0    some_string
> > > >
> > > > 1           None
> > > >
> > > > dtype: object
> > > >
> > > >
> > > > I have looked around the pyarrow doc and didn't find an option to use
> > > > np.NaN for null values with to_pandas so it's a bit hard to get around
> > > trip
> > > > consistency.
> > > >
> > > >
> > > > I appreciate any thoughts on this as to how to achieve consistency here.
> > > >
> > > >
> > > > Thanks!
> > > >
> > > > Li
> > >

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Posted by Joris Van den Bossche <jo...@gmail.com>.

That won't help in this specific case, since it is for an array of
strings (which you can't fill with NaN), and for floating point
arrays, we already use np.nan as "null" representation when converting
to numpy/pandas.

On Wed, 9 Jun 2021 at 03:37, Benjamin Kietzman <be...@gmail.com> wrote:
>
> As a workaround, the "fill_null" compute function can be used to replace
> nulls with nans:
>
> >>> nan = pa.scalar(np.NaN, type=pa.float64())
> >>> pa.Array.from_pandas(s).fill_null(nan).to_pandas()
>
> On Tue, Jun 8, 2021, 16:15 Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
>
> > Hi Li,
> >
> > It's correct that arrow uses "None" for null values when converting a
> > string array to numpy / pandas.
> > As far as I am aware, there is currently no option to control that
> > (and to make it use np.nan instead), and I am not sure there would be
> > much interest in adding such an option.
> >
> > Now, I know this doesn't give an exact roundtrip in this case, but
> > pandas does treat both np.nan and None as missing values in object
> > dtype columns, so behaviour-wise this shouldn't give any difference
> > and the roundtrip is still faithful on that aspect.
> >
> > Best,
> > Joris
> >
> > On Tue, 8 Jun 2021 at 21:59, Li Jin <ic...@gmail.com> wrote:
> > >
> > > Hello!
> > >
> > > Apologies if this has been brought before. I'd like to get devs' thoughts
> > > on this potential inconsistency of "what are the python objects for null
> > > values" between pandas and pyarrow.
> > >
> > > Demonstrated with the following example:
> > >
> > > (1)  pandas seems to use "np.NaN" to represent a missing value (with
> > pandas
> > > 1.2.4):
> > >
> > > In [*32*]: df
> > >
> > > Out[*32*]:
> > >
> > >            value
> > >
> > > key
> > >
> > > 1    some_strign
> > >
> > >
> > > In [*33*]: df2
> > >
> > > Out[*33*]:
> > >
> > >                 value2
> > >
> > > key
> > >
> > > 2    some_other_string
> > >
> > >
> > > In [*34*]: df.join(df2)
> > >
> > > Out[*34*]:
> > >
> > >            value value2
> > >
> > > key
> > >
> > > 1    some_strign    *NaN*
> > >
> > >
> > >
> > > (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
> > >
> > > >>> s = pd.Series(["some_string", np.NaN])
> > >
> > > >>> s
> > >
> > > 0    some_string
> > >
> > > 1            NaN
> > >
> > > dtype: object
> > >
> > > >>> pa.Array.from_pandas(s).to_pandas()
> > >
> > > 0    some_string
> > >
> > > 1           None
> > >
> > > dtype: object
> > >
> > >
> > > I have looked around the pyarrow doc and didn't find an option to use
> > > np.NaN for null values with to_pandas so it's a bit hard to get around
> > trip
> > > consistency.
> > >
> > >
> > > I appreciate any thoughts on this as to how to achieve consistency here.
> > >
> > >
> > > Thanks!
> > >
> > > Li
> >

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Posted by Benjamin Kietzman <be...@gmail.com>.

As a workaround, the "fill_null" compute function can be used to replace
nulls with nans:

>>> nan = pa.scalar(np.NaN, type=pa.float64())
>>> pa.Array.from_pandas(s).fill_null(nan).to_pandas()

On Tue, Jun 8, 2021, 16:15 Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Hi Li,
>
> It's correct that arrow uses "None" for null values when converting a
> string array to numpy / pandas.
> As far as I am aware, there is currently no option to control that
> (and to make it use np.nan instead), and I am not sure there would be
> much interest in adding such an option.
>
> Now, I know this doesn't give an exact roundtrip in this case, but
> pandas does treat both np.nan and None as missing values in object
> dtype columns, so behaviour-wise this shouldn't give any difference
> and the roundtrip is still faithful on that aspect.
>
> Best,
> Joris
>
> On Tue, 8 Jun 2021 at 21:59, Li Jin <ic...@gmail.com> wrote:
> >
> > Hello!
> >
> > Apologies if this has been brought before. I'd like to get devs' thoughts
> > on this potential inconsistency of "what are the python objects for null
> > values" between pandas and pyarrow.
> >
> > Demonstrated with the following example:
> >
> > (1)  pandas seems to use "np.NaN" to represent a missing value (with
> pandas
> > 1.2.4):
> >
> > In [*32*]: df
> >
> > Out[*32*]:
> >
> >            value
> >
> > key
> >
> > 1    some_strign
> >
> >
> > In [*33*]: df2
> >
> > Out[*33*]:
> >
> >                 value2
> >
> > key
> >
> > 2    some_other_string
> >
> >
> > In [*34*]: df.join(df2)
> >
> > Out[*34*]:
> >
> >            value value2
> >
> > key
> >
> > 1    some_strign    *NaN*
> >
> >
> >
> > (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
> >
> > >>> s = pd.Series(["some_string", np.NaN])
> >
> > >>> s
> >
> > 0    some_string
> >
> > 1            NaN
> >
> > dtype: object
> >
> > >>> pa.Array.from_pandas(s).to_pandas()
> >
> > 0    some_string
> >
> > 1           None
> >
> > dtype: object
> >
> >
> > I have looked around the pyarrow doc and didn't find an option to use
> > np.NaN for null values with to_pandas so it's a bit hard to get around
> trip
> > consistency.
> >
> >
> > I appreciate any thoughts on this as to how to achieve consistency here.
> >
> >
> > Thanks!
> >
> > Li
>

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Posted by Joris Van den Bossche <jo...@gmail.com>.

Hi Li,

It's correct that arrow uses "None" for null values when converting a
string array to numpy / pandas.
As far as I am aware, there is currently no option to control that
(and to make it use np.nan instead), and I am not sure there would be
much interest in adding such an option.

Now, I know this doesn't give an exact roundtrip in this case, but
pandas does treat both np.nan and None as missing values in object
dtype columns, so behaviour-wise this shouldn't give any difference
and the roundtrip is still faithful on that aspect.

Best,
Joris

On Tue, 8 Jun 2021 at 21:59, Li Jin <ic...@gmail.com> wrote:
>
> Hello!
>
> Apologies if this has been brought before. I'd like to get devs' thoughts
> on this potential inconsistency of "what are the python objects for null
> values" between pandas and pyarrow.
>
> Demonstrated with the following example:
>
> (1)  pandas seems to use "np.NaN" to represent a missing value (with pandas
> 1.2.4):
>
> In [*32*]: df
>
> Out[*32*]:
>
>            value
>
> key
>
> 1    some_strign
>
>
> In [*33*]: df2
>
> Out[*33*]:
>
>                 value2
>
> key
>
> 2    some_other_string
>
>
> In [*34*]: df.join(df2)
>
> Out[*34*]:
>
>            value value2
>
> key
>
> 1    some_strign    *NaN*
>
>
>
> (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
>
> >>> s = pd.Series(["some_string", np.NaN])
>
> >>> s
>
> 0    some_string
>
> 1            NaN
>
> dtype: object
>
> >>> pa.Array.from_pandas(s).to_pandas()
>
> 0    some_string
>
> 1           None
>
> dtype: object
>
>
> I have looked around the pyarrow doc and didn't find an option to use
> np.NaN for null values with to_pandas so it's a bit hard to get around trip
> consistency.
>
>
> I appreciate any thoughts on this as to how to achieve consistency here.
>
>
> Thanks!
>
> Li