You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Spencer Nelson <sw...@uw.edu> on 2023/05/02 22:38:48 UTC

Python: Array.to_numpy(), nullable data, and masked arrays

What's the right way to convert Arrow arrays to numpy arrays in the
presence of nulls?

The first thing I reach for is array.to_numpy(zero_safe_copy=False). But
this has some behaviors that I found a little undesirable.

For numeric data (or at least int64 and float64), nulls are converted to
floating point NaNs and the resulting numpy array is recast from integer to
floating point. For example:

>>> pa.array([1, 2, 3, None, 5])
<pyarrow.lib.Int64Array object at 0x111b970a0>
[
  1,
  2,
  3,
  null,
  5
]
>>> a.to_numpy(False)
array([ 1.,  2.,  3., nan,  5.])

This can be problematic: *actual* floating point NaNs are mixed with nulls,
which is lossy:

>>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
array([ 1.,  2., nan, nan])

Boolean arrays get converted into 'object'-dtyped numpy arrays, with
'True', 'False', and 'None', which is a little undesirable as well.

One tool in numpy for dealing with nullable data is masked arrays (
https://numpy.org/doc/stable/reference/maskedarray.html) which work
somewhat like Arrow arrays' validity bitmap. I was thinking of writing some
code that generates a numpy masked array from an arrow array, but I'd need
to get the validity bitmap itself, and it doesn't seem to be accessible in
any pyarrow APIs. Am I missing it?

Or, am I thinking about this wrong, and there's some other way to pull
nullable data out of arrow and into numpy?

Thanks,
Spencer

Re: Python: Array.to_numpy(), nullable data, and masked arrays

Posted by Aldrin <oc...@pm.me>.
mmm, just to clarify, based on the initial message, `null_is_nan=True` would represent the current default behavior of `to_numpy`. By adding that as a flag, modification to the `to_numpy` function can be preseved (if desired; if not, then my whole recommendation is moot).

On the other hand, the `is_null` compute function defaults to `nan_is_null=False`, and if we can set the option for that function, then it's possible to drop all NaN values when calling `to_numpy`.

So, controlling both seems desirable, even if we want to capture that behavior in a single flag for usability (or differently named flags)



# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene


Sent with Proton Mail secure email.

------- Original Message -------
On Tuesday, May 2nd, 2023 at 18:27, Aldrin <oc...@pm.me> wrote:


> orrr maybe you can add both `nan_is_null` and `null_is_nan`?
> 

> The compute fn takes `nan_is_null` as an option to either return true (null) for NaN values or return false (not null) for NaN values.
> 

> The opposite can be used by the `to_numpy` function to return nulls as masked (true) or as unmasked (false).
> 

> This would require documentation to specify the resolution order (compute fn resolves `nan_is_null` first, then conversion function resolves `null_is_nan` second). I think it'd probably be more usable to define a single flag that controls both options, but just throwing the possibility out there.
> 

> Either way, if you open an issue and submit a PR then the various approaches can be discussed there also.
> 

> The implementation of the `is_null` compute function in C++ can be found at [2], just for future reference (I wanted to check that there isn't any repetitive work if it's called from the `to_numpy` function).
> 

> 

> [1]: https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html#pyarrow.compute.is_null
> 

> [2]: https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_validity.cc#LL105C1-L105C1
> 

> 

> 

> 

> 

> # ------------------------------
> 

> # Aldrin
> 

> 

> https://github.com/drin/
> 

> https://gitlab.com/octalene
> 

> 

> Sent with Proton Mail secure email.
> 

> ------- Original Message -------
> On Tuesday, May 2nd, 2023 at 17:52, Aldrin <oc...@pm.me> wrote:
> 

> 

> > cool!
> > 

> > > Is this something I should contribute back to pyarrow...
> > 

> > probably!
> > 

> > > ...as the default behavior... when presented with a fixed-width primitive list that has nulls
> > 

> > I am not sure about this. I would assume the use of maskedarray can be mostly hidden, so it's probably a good idea, but I would sometimes prefer something like that to be explicit, especially since it has different behavior as you mentioned before (e.g. mixes nulls with NaNs).
> > 

> > So, my preference would be to contribute it, but somehow using a flag (e.g. 'drop_nulls' or 'use_validity') or something.
> > 

> > Based on the way `to_numpy` is written ([1]), I think adding a flag and adding a condition after `ConvertArrayToPandas` is called seems like a reasonable approach.
> > 

> > 

> > [1]: https://github.com/apache/arrow/blob/main/python/pyarrow/array.pxi#L1527
> > 

> > 

> > 

> > 

> > # ------------------------------
> > 

> > # Aldrin
> > 

> > 

> > https://github.com/drin/
> > 

> > https://gitlab.com/octalene
> > 

> > 

> > Sent with Proton Mail secure email.
> > 

> > ------- Original Message -------
> > On Tuesday, May 2nd, 2023 at 17:27, Spencer Nelson <sw...@uw.edu> wrote:
> > 

> > 

> > > Thanks, both - this is helpful. pyarrow.compute.is_null is exactly what I was looking for.
> > > 

> > > Masked arrays for fixed-width primitive types turn out to be reasonably simple. I can call array.buffers() to get the underlying data buffer, and use numpy.frombuffer on it. For the fixed-width primitives, it appears that the memory layout is identical, so this works.
> > > Then I can build the masked array with something like `np.ma.masked_array(data_from_buffer, mask_from_is_null)` and it works fine.
> > > The whole thing:
> > > ```
> > > import numpy as np
> > > import pyarrow as pa
> > > import pyarrow.compute as pc
> > > 

> > > def to_masked_array(array):
> > > _, data_buf = array.buffers()
> > > data = np.frombuffer(data_buf, array.dtype.to_pandas_dtype())
> > > mask = pc.is_null(array)
> > > return np.ma.masked_array(data, mask)
> > > ```
> > > 

> > > "array.dtype.to_pandas_dtype()" is a bit odd, there. There's a pyarrow.from_numpy_dtype, but no pyarrow.to_numpy_dtype to go the other way. to_pandas_dtype seems to work despite the name, though.
> > > 

> > > I don't think this could be made very simple for variable-length primitives or complex arrow types, but I can live with that.
> > > 

> > > I believe this whole thing works with zero copy. Is this something I should contribute back to pyarrow as the default behavior of to_numpy() when presented with a fixed-width primitive list that has nulls?
> > > 

> > > On Tue, May 2, 2023 at 5:09 PM Steve Kim <ch...@gmail.com> wrote:
> > > 

> > > > This Message Is From an Untrusted Sender
> > > > You have not previously corresponded with this sender.
> > > > See https://itconnect.uw.edu/email-tags for additional information. Please contact the UW-IT Service Center, help@uw.edu 206.221.5000, for assistance.
> > > > 

> > > > Adding to Aldrin's very informative answer: the pyarrow.compute.is_null function (https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html) returns a boolean array that can be converted to a mask for numpy.ma.MaskedArray
> > > > 

> > > > On Tue, May 2, 2023, 18:26 Aldrin <oc...@pm.me> wrote:
> > > > 

> > > > > I think per [1] and [2], because your data has null values, there is no good and supported approach to a zero-copy conversion to pandas or numpy. So, I think [3] to drop nulls, then use to_numpy() is the path of least resistance.
> > > > > 

> > > > > 

> > > > > If you want to try and do the masked array approach, you need to go from: (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the Buffer as appropriate.
> > > > > 

> > > > > 

> > > > > For (1), see [4]. For (2), see [5]. Then, [6] explains that for a fixed-width primitive data type, the first buffer is the validity bitmap. I am not sure that floats are fixed width, but I think they are. I know that Decimal types are a binary format.
> > > > > 

> > > > > 

> > > > > I think [7] will be helpful to see how the validity bitmap is used in C++, not sure how familiar you are, but I'm not sure how far down the rabbit hole you'd have to go to use the validity bitmap from python.
> > > > > 

> > > > > 

> > > > > 

> > > > > 

> > > > > 

> > > > > [1]: https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions
> > > > > 

> > > > > [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
> > > > > 

> > > > > [3]: https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null
> > > > > 

> > > > > [4]: https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219
> > > > > 

> > > > > [5]: https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173
> > > > > 

> > > > > [6]: https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout
> > > > > 

> > > > > [7]: https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102
> > > > > 

> > > > > 

> > > > > 

> > > > > 

> > > > > # ------------------------------
> > > > > 

> > > > > # Aldrin
> > > > > 

> > > > > 

> > > > > https://github.com/drin/
> > > > > 

> > > > > https://gitlab.com/octalene
> > > > > 

> > > > > 

> > > > > Sent with Proton Mail secure email.
> > > > > 

> > > > > ------- Original Message -------
> > > > > On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <sw...@uw.edu> wrote:
> > > > > 

> > > > > 

> > > > > > What's the right way to convert Arrow arrays to numpy arrays in the presence of nulls?
> > > > > > The first thing I reach for is array.to_numpy(zero_safe_copy=False). But this has some behaviors that I found a little undesirable.
> > > > > > 

> > > > > > For numeric data (or at least int64 and float64), nulls are converted to floating point NaNs and the resulting numpy array is recast from integer to floating point. For example:
> > > > > > 

> > > > > > >>> pa.array([1, 2, 3, None, 5])
> > > > > > <pyarrow.lib.Int64Array object at 0x111b970a0>
> > > > > > [
> > > > > > 1,
> > > > > > 2,
> > > > > > 3,
> > > > > > null,
> > > > > > 5
> > > > > > ]
> > > > > > >>> a.to_numpy(False)
> > > > > > array([ 1., 2., 3., nan, 5.])
> > > > > > This can be problematic: actual floating point NaNs are mixed with nulls, which is lossy:
> > > > > > 

> > > > > > >>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
> > > > > > array([ 1., 2., nan, nan])
> > > > > > 

> > > > > > Boolean arrays get converted into 'object'-dtyped numpy arrays, with 'True', 'False', and 'None', which is a little undesirable as well.
> > > > > > 

> > > > > > One tool in numpy for dealing with nullable data is masked arrays (https://numpy.org/doc/stable/reference/maskedarray.html) which work somewhat like Arrow arrays' validity bitmap. I was thinking of writing some code that generates a numpy masked array from an arrow array, but I'd need to get the validity bitmap itself, and it doesn't seem to be accessible in any pyarrow APIs. Am I missing it?
> > > > > > 

> > > > > > Or, am I thinking about this wrong, and there's some other way to pull nullable data out of arrow and into numpy?
> > > > > > 

> > > > > > Thanks,
> > > > > > Spencer
> > > > > > 

> > > > > > 

Re: Python: Array.to_numpy(), nullable data, and masked arrays

Posted by Aldrin <oc...@pm.me>.
orrr maybe you can add both `nan_is_null` and `null_is_nan`?

The compute fn takes `nan_is_null` as an option to either return true (null) for NaN values or return false (not null) for NaN values.

The opposite can be used by the `to_numpy` function to return nulls as masked (true) or as unmasked (false).

This would require documentation to specify the resolution order (compute fn resolves `nan_is_null` first, then conversion function resolves `null_is_nan` second). I think it'd probably be more usable to define a single flag that controls both options, but just throwing the possibility out there.

Either way, if you open an issue and submit a PR then the various approaches can be discussed there also.

The implementation of the `is_null` compute function in C++ can be found at [2], just for future reference (I wanted to check that there isn't any repetitive work if it's called from the `to_numpy` function).


[1]: https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html#pyarrow.compute.is_null

[2]: https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_validity.cc#LL105C1-L105C1





# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene


Sent with Proton Mail secure email.

------- Original Message -------
On Tuesday, May 2nd, 2023 at 17:52, Aldrin <oc...@pm.me> wrote:


> cool!
> 

> > Is this something I should contribute back to pyarrow...
> 

> probably!
> 

> > ...as the default behavior... when presented with a fixed-width primitive list that has nulls
> 

> I am not sure about this. I would assume the use of maskedarray can be mostly hidden, so it's probably a good idea, but I would sometimes prefer something like that to be explicit, especially since it has different behavior as you mentioned before (e.g. mixes nulls with NaNs).
> 

> So, my preference would be to contribute it, but somehow using a flag (e.g. 'drop_nulls' or 'use_validity') or something.
> 

> Based on the way `to_numpy` is written ([1]), I think adding a flag and adding a condition after `ConvertArrayToPandas` is called seems like a reasonable approach.
> 

> 

> [1]: https://github.com/apache/arrow/blob/main/python/pyarrow/array.pxi#L1527
> 

> 

> 

> 

> # ------------------------------
> 

> # Aldrin
> 

> 

> https://github.com/drin/
> 

> https://gitlab.com/octalene
> 

> 

> Sent with Proton Mail secure email.
> 

> ------- Original Message -------
> On Tuesday, May 2nd, 2023 at 17:27, Spencer Nelson <sw...@uw.edu> wrote:
> 

> 

> > Thanks, both - this is helpful. pyarrow.compute.is_null is exactly what I was looking for.
> > 

> > Masked arrays for fixed-width primitive types turn out to be reasonably simple. I can call array.buffers() to get the underlying data buffer, and use numpy.frombuffer on it. For the fixed-width primitives, it appears that the memory layout is identical, so this works.
> > Then I can build the masked array with something like `np.ma.masked_array(data_from_buffer, mask_from_is_null)` and it works fine.
> > The whole thing:
> > ```
> > import numpy as np
> > import pyarrow as pa
> > import pyarrow.compute as pc
> > 

> > def to_masked_array(array):
> > _, data_buf = array.buffers()
> > data = np.frombuffer(data_buf, array.dtype.to_pandas_dtype())
> > mask = pc.is_null(array)
> > return np.ma.masked_array(data, mask)
> > ```
> > 

> > "array.dtype.to_pandas_dtype()" is a bit odd, there. There's a pyarrow.from_numpy_dtype, but no pyarrow.to_numpy_dtype to go the other way. to_pandas_dtype seems to work despite the name, though.
> > 

> > I don't think this could be made very simple for variable-length primitives or complex arrow types, but I can live with that.
> > 

> > I believe this whole thing works with zero copy. Is this something I should contribute back to pyarrow as the default behavior of to_numpy() when presented with a fixed-width primitive list that has nulls?
> > 

> > On Tue, May 2, 2023 at 5:09 PM Steve Kim <ch...@gmail.com> wrote:
> > 

> > > This Message Is From an Untrusted Sender
> > > You have not previously corresponded with this sender.
> > > See https://itconnect.uw.edu/email-tags for additional information. Please contact the UW-IT Service Center, help@uw.edu 206.221.5000, for assistance.
> > > 

> > > Adding to Aldrin's very informative answer: the pyarrow.compute.is_null function (https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html) returns a boolean array that can be converted to a mask for numpy.ma.MaskedArray
> > > 

> > > On Tue, May 2, 2023, 18:26 Aldrin <oc...@pm.me> wrote:
> > > 

> > > > I think per [1] and [2], because your data has null values, there is no good and supported approach to a zero-copy conversion to pandas or numpy. So, I think [3] to drop nulls, then use to_numpy() is the path of least resistance.
> > > > 

> > > > 

> > > > If you want to try and do the masked array approach, you need to go from: (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the Buffer as appropriate.
> > > > 

> > > > 

> > > > For (1), see [4]. For (2), see [5]. Then, [6] explains that for a fixed-width primitive data type, the first buffer is the validity bitmap. I am not sure that floats are fixed width, but I think they are. I know that Decimal types are a binary format.
> > > > 

> > > > 

> > > > I think [7] will be helpful to see how the validity bitmap is used in C++, not sure how familiar you are, but I'm not sure how far down the rabbit hole you'd have to go to use the validity bitmap from python.
> > > > 

> > > > 

> > > > 

> > > > 

> > > > 

> > > > [1]: https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions
> > > > 

> > > > [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
> > > > 

> > > > [3]: https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null
> > > > 

> > > > [4]: https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219
> > > > 

> > > > [5]: https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173
> > > > 

> > > > [6]: https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout
> > > > 

> > > > [7]: https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102
> > > > 

> > > > 

> > > > 

> > > > 

> > > > # ------------------------------
> > > > 

> > > > # Aldrin
> > > > 

> > > > 

> > > > https://github.com/drin/
> > > > 

> > > > https://gitlab.com/octalene
> > > > 

> > > > 

> > > > Sent with Proton Mail secure email.
> > > > 

> > > > ------- Original Message -------
> > > > On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <sw...@uw.edu> wrote:
> > > > 

> > > > 

> > > > > What's the right way to convert Arrow arrays to numpy arrays in the presence of nulls?
> > > > > The first thing I reach for is array.to_numpy(zero_safe_copy=False). But this has some behaviors that I found a little undesirable.
> > > > > 

> > > > > For numeric data (or at least int64 and float64), nulls are converted to floating point NaNs and the resulting numpy array is recast from integer to floating point. For example:
> > > > > 

> > > > > >>> pa.array([1, 2, 3, None, 5])
> > > > > <pyarrow.lib.Int64Array object at 0x111b970a0>
> > > > > [
> > > > > 1,
> > > > > 2,
> > > > > 3,
> > > > > null,
> > > > > 5
> > > > > ]
> > > > > >>> a.to_numpy(False)
> > > > > array([ 1., 2., 3., nan, 5.])
> > > > > This can be problematic: actual floating point NaNs are mixed with nulls, which is lossy:
> > > > > 

> > > > > >>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
> > > > > array([ 1., 2., nan, nan])
> > > > > 

> > > > > Boolean arrays get converted into 'object'-dtyped numpy arrays, with 'True', 'False', and 'None', which is a little undesirable as well.
> > > > > 

> > > > > One tool in numpy for dealing with nullable data is masked arrays (https://numpy.org/doc/stable/reference/maskedarray.html) which work somewhat like Arrow arrays' validity bitmap. I was thinking of writing some code that generates a numpy masked array from an arrow array, but I'd need to get the validity bitmap itself, and it doesn't seem to be accessible in any pyarrow APIs. Am I missing it?
> > > > > 

> > > > > Or, am I thinking about this wrong, and there's some other way to pull nullable data out of arrow and into numpy?
> > > > > 

> > > > > Thanks,
> > > > > Spencer
> > > > > 

> > > > > 

Re: Python: Array.to_numpy(), nullable data, and masked arrays

Posted by Aldrin <oc...@pm.me>.
cool!

> Is this something I should contribute back to pyarrow...

probably!

> ...as the default behavior... when presented with a fixed-width primitive list that has nulls

I am not sure about this. I would assume the use of maskedarray can be mostly hidden, so it's probably a good idea, but I would sometimes prefer something like that to be explicit, especially since it has different behavior as you mentioned before (e.g. mixes nulls with NaNs).

So, my preference would be to contribute it, but somehow using a flag (e.g. 'drop_nulls' or 'use_validity') or something.

Based on the way `to_numpy` is written ([1]), I think adding a flag and adding a condition after `ConvertArrayToPandas` is called seems like a reasonable approach.


[1]: https://github.com/apache/arrow/blob/main/python/pyarrow/array.pxi#L1527




# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene


Sent with Proton Mail secure email.

------- Original Message -------
On Tuesday, May 2nd, 2023 at 17:27, Spencer Nelson <sw...@uw.edu> wrote:


> Thanks, both - this is helpful. pyarrow.compute.is_null is exactly what I was looking for.
> 

> Masked arrays for fixed-width primitive types turn out to be reasonably simple. I can call array.buffers() to get the underlying data buffer, and use numpy.frombuffer on it. For the fixed-width primitives, it appears that the memory layout is identical, so this works.
> Then I can build the masked array with something like `np.ma.masked_array(data_from_buffer, mask_from_is_null)` and it works fine.
> The whole thing:
> ```
> import numpy as np
> import pyarrow as pa
> import pyarrow.compute as pc
> 

> def to_masked_array(array):
> _, data_buf = array.buffers()
> data = np.frombuffer(data_buf, array.dtype.to_pandas_dtype())
> mask = pc.is_null(array)
> return np.ma.masked_array(data, mask)
> ```
> 

> "array.dtype.to_pandas_dtype()" is a bit odd, there. There's a pyarrow.from_numpy_dtype, but no pyarrow.to_numpy_dtype to go the other way. to_pandas_dtype seems to work despite the name, though.
> 

> I don't think this could be made very simple for variable-length primitives or complex arrow types, but I can live with that.
> 

> I believe this whole thing works with zero copy. Is this something I should contribute back to pyarrow as the default behavior of to_numpy() when presented with a fixed-width primitive list that has nulls?
> 

> On Tue, May 2, 2023 at 5:09 PM Steve Kim <ch...@gmail.com> wrote:
> 

> > This Message Is From an Untrusted Sender
> > You have not previously corresponded with this sender.
> > See https://itconnect.uw.edu/email-tags for additional information. Please contact the UW-IT Service Center, help@uw.edu 206.221.5000, for assistance.
> > 

> > Adding to Aldrin's very informative answer: the pyarrow.compute.is_null function (https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html) returns a boolean array that can be converted to a mask for numpy.ma.MaskedArray
> > 

> > On Tue, May 2, 2023, 18:26 Aldrin <oc...@pm.me> wrote:
> > 

> > > I think per [1] and [2], because your data has null values, there is no good and supported approach to a zero-copy conversion to pandas or numpy. So, I think [3] to drop nulls, then use to_numpy() is the path of least resistance.
> > > 

> > > 

> > > If you want to try and do the masked array approach, you need to go from: (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the Buffer as appropriate.
> > > 

> > > 

> > > For (1), see [4]. For (2), see [5]. Then, [6] explains that for a fixed-width primitive data type, the first buffer is the validity bitmap. I am not sure that floats are fixed width, but I think they are. I know that Decimal types are a binary format.
> > > 

> > > 

> > > I think [7] will be helpful to see how the validity bitmap is used in C++, not sure how familiar you are, but I'm not sure how far down the rabbit hole you'd have to go to use the validity bitmap from python.
> > > 

> > > 

> > > 

> > > 

> > > 

> > > [1]: https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions
> > > 

> > > [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
> > > 

> > > [3]: https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null
> > > 

> > > [4]: https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219
> > > 

> > > [5]: https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173
> > > 

> > > [6]: https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout
> > > 

> > > [7]: https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102
> > > 

> > > 

> > > 

> > > 

> > > # ------------------------------
> > > 

> > > # Aldrin
> > > 

> > > 

> > > https://github.com/drin/
> > > 

> > > https://gitlab.com/octalene
> > > 

> > > 

> > > Sent with Proton Mail secure email.
> > > 

> > > ------- Original Message -------
> > > On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <sw...@uw.edu> wrote:
> > > 

> > > 

> > > > What's the right way to convert Arrow arrays to numpy arrays in the presence of nulls?
> > > > The first thing I reach for is array.to_numpy(zero_safe_copy=False). But this has some behaviors that I found a little undesirable.
> > > > 

> > > > For numeric data (or at least int64 and float64), nulls are converted to floating point NaNs and the resulting numpy array is recast from integer to floating point. For example:
> > > > 

> > > > >>> pa.array([1, 2, 3, None, 5])
> > > > <pyarrow.lib.Int64Array object at 0x111b970a0>
> > > > [
> > > > 1,
> > > > 2,
> > > > 3,
> > > > null,
> > > > 5
> > > > ]
> > > > >>> a.to_numpy(False)
> > > > array([ 1., 2., 3., nan, 5.])
> > > > This can be problematic: actual floating point NaNs are mixed with nulls, which is lossy:
> > > > 

> > > > >>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
> > > > array([ 1., 2., nan, nan])
> > > > 

> > > > Boolean arrays get converted into 'object'-dtyped numpy arrays, with 'True', 'False', and 'None', which is a little undesirable as well.
> > > > 

> > > > One tool in numpy for dealing with nullable data is masked arrays (https://numpy.org/doc/stable/reference/maskedarray.html) which work somewhat like Arrow arrays' validity bitmap. I was thinking of writing some code that generates a numpy masked array from an arrow array, but I'd need to get the validity bitmap itself, and it doesn't seem to be accessible in any pyarrow APIs. Am I missing it?
> > > > 

> > > > Or, am I thinking about this wrong, and there's some other way to pull nullable data out of arrow and into numpy?
> > > > 

> > > > Thanks,
> > > > Spencer
> > > > 

> > > > 

Re: Python: Array.to_numpy(), nullable data, and masked arrays

Posted by Spencer Nelson <sw...@uw.edu>.
Thanks, both - this is helpful. pyarrow.compute.is_null is exactly what I
was looking for.

Masked arrays for fixed-width primitive types turn out to be reasonably
simple. I can call array.buffers() to get the underlying data buffer, and
use numpy.frombuffer on it. For the fixed-width primitives, it appears that
the memory layout is identical, so this works.

Then I can build the masked array with something like
`np.ma.masked_array(data_from_buffer, mask_from_is_null)` and it works
fine.

The whole thing:
```
import numpy as np
import pyarrow as pa
import pyarrow.compute as pc

def to_masked_array(array):
    _, data_buf = array.buffers()
    data = np.frombuffer(data_buf, array.dtype.to_pandas_dtype())
    mask = pc.is_null(array)
    return np.ma.masked_array(data, mask)
```

"array.dtype.to_pandas_dtype()" is a bit odd, there. There's a
pyarrow.from_numpy_dtype, but no pyarrow.to_numpy_dtype to go the other
way. to_pandas_dtype seems to work despite the name, though.

I don't think this could be made very simple for variable-length primitives
or complex arrow types, but I can live with that.

I believe this whole thing works with zero copy. Is this something I should
contribute back to pyarrow as the default behavior of to_numpy() when
presented with a fixed-width primitive list that has nulls?

On Tue, May 2, 2023 at 5:09 PM Steve Kim <ch...@gmail.com> wrote:

> Adding to Aldrin's very informative answer: the pyarrow. compute. is_null
> function (https: //arrow. apache. org/docs/python/generated/pyarrow.
> compute. is_null. html) returns a boolean array that can be converted to a
> mask for numpy. ma. MaskedArrayOn
> ZjQcmQRYFpfptBannerStart
> This Message Is From an Untrusted Sender
> You have not previously corresponded with this sender.
> See https://itconnect.uw.edu/email-tags for additional information.
> Please contact the UW-IT Service Center, help@uw.edu 206.221.5000, for
> assistance.
>
> ZjQcmQRYFpfptBannerEnd
> Adding to Aldrin's very informative answer: the pyarrow.compute.is_null
> function (
> https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html
> <https://urldefense.com/v3/__https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html__;!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5MVwaK3Z$>)
> returns a boolean array that can be converted to a mask for
> numpy.ma.MaskedArray
>
> On Tue, May 2, 2023, 18:26 Aldrin <oc...@pm.me> wrote:
>
>> I think per [1] and [2], because your data has null values, there is no
>> good and supported approach to a zero-copy conversion to pandas or numpy.
>> So, I think [3] to drop nulls, then use to_numpy() is the path of least
>> resistance.
>>
>> If you want to try and do the masked array approach, you need to go from:
>> (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the Buffer as
>> appropriate.
>>
>> For (1), see [4]. For (2), see [5]. Then, [6] explains that for a
>> fixed-width primitive data type, the first buffer is the validity bitmap. I
>> am not sure that floats are fixed width, but I think they are. I know that
>> Decimal types are a binary format.
>>
>> I think [7] will be helpful to see how the validity bitmap is used in
>> C++, not sure how familiar you are, but I'm not sure how far down the
>> rabbit hole you'd have to go to use the validity bitmap from python.
>>
>>
>> [1]:
>> https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions
>> <https://urldefense.com/v3/__https://arrow.apache.org/docs/python/pandas.html*zero-copy-series-conversions__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5Lq2L6-B$>
>> [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
>> <https://urldefense.com/v3/__https://arrow.apache.org/docs/python/numpy.html*arrow-to-numpy__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5JMTtTZb$>
>> [3]:
>> https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null
>> <https://urldefense.com/v3/__https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html*pyarrow.compute.drop_null__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5IYUQ_RH$>
>> [4]:
>> https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219
>> <https://urldefense.com/v3/__https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd*L219__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5CYYldrV$>
>> [5]:
>> https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173
>> <https://urldefense.com/v3/__https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd*L173__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5K5BOJl4$>
>> [6]:
>> https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout
>> <https://urldefense.com/v3/__https://arrow.apache.org/docs/format/Columnar.html*fixed-size-primitive-layout__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5HcWlZ-Q$>
>> [7]:
>> https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102
>> <https://urldefense.com/v3/__https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc*L102__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5NoPc4WE$>
>>
>>
>> # ------------------------------
>> # Aldrin
>>
>> https://github.com/drin/
>> <https://urldefense.com/v3/__https://github.com/drin/__;!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5AoyieVN$>
>> https://gitlab.com/octalene
>> <https://urldefense.com/v3/__https://gitlab.com/octalene__;!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5ORopb5t$>
>>
>> Sent with Proton Mail
>> <https://urldefense.com/v3/__https://proton.me/__;!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5Mzn85Ej$>
>> secure email.
>>
>> ------- Original Message -------
>> On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <sw...@uw.edu>
>> wrote:
>>
>> What's the right way to convert Arrow arrays to numpy arrays in the
>> presence of nulls?
>>
>> The first thing I reach for is array.to_numpy(zero_safe_copy=False). But
>> this has some behaviors that I found a little undesirable.
>>
>> For numeric data (or at least int64 and float64), nulls are converted to
>> floating point NaNs and the resulting numpy array is recast from integer to
>> floating point. For example:
>>
>> >>> pa.array([1, 2, 3, None, 5])
>> <pyarrow.lib.Int64Array object at 0x111b970a0>
>> [
>> 1,
>> 2,
>> 3,
>> null,
>> 5
>> ]
>> >>> a.to_numpy(False)
>> array([ 1., 2., 3., nan, 5.])
>>
>> This can be problematic: *actual* floating point NaNs are mixed with
>> nulls, which is lossy:
>>
>> >>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
>> array([ 1., 2., nan, nan])
>>
>> Boolean arrays get converted into 'object'-dtyped numpy arrays, with
>> 'True', 'False', and 'None', which is a little undesirable as well.
>>
>> One tool in numpy for dealing with nullable data is masked arrays (
>> https://numpy.org/doc/stable/reference/maskedarray.html
>> <https://urldefense.com/v3/__https://numpy.org/doc/stable/reference/maskedarray.html__;!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5CfS-IFZ$>)
>> which work somewhat like Arrow arrays' validity bitmap. I was thinking of
>> writing some code that generates a numpy masked array from an arrow array,
>> but I'd need to get the validity bitmap itself, and it doesn't seem to be
>> accessible in any pyarrow APIs. Am I missing it?
>>
>> Or, am I thinking about this wrong, and there's some other way to pull
>> nullable data out of arrow and into numpy?
>>
>> Thanks,
>> Spencer
>>
>>
>>
>>

Re: Python: Array.to_numpy(), nullable data, and masked arrays

Posted by Steve Kim <ch...@gmail.com>.
Adding to Aldrin's very informative answer: the pyarrow.compute.is_null
function (
https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html)
returns a boolean array that can be converted to a mask for
numpy.ma.MaskedArray

On Tue, May 2, 2023, 18:26 Aldrin <oc...@pm.me> wrote:

> I think per [1] and [2], because your data has null values, there is no
> good and supported approach to a zero-copy conversion to pandas or numpy.
> So, I think [3] to drop nulls, then use to_numpy() is the path of least
> resistance.
>
> If you want to try and do the masked array approach, you need to go from:
> (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the Buffer as
> appropriate.
>
> For (1), see [4]. For (2), see [5]. Then, [6] explains that for a
> fixed-width primitive data type, the first buffer is the validity bitmap. I
> am not sure that floats are fixed width, but I think they are. I know that
> Decimal types are a binary format.
>
> I think [7] will be helpful to see how the validity bitmap is used in C++,
> not sure how familiar you are, but I'm not sure how far down the rabbit
> hole you'd have to go to use the validity bitmap from python.
>
>
> [1]:
> https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions
> [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
> [3]:
> https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null
> [4]:
> https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219
> [5]:
> https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173
> [6]:
> https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout
> [7]:
> https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102
>
>
> # ------------------------------
> # Aldrin
>
> https://github.com/drin/
> https://gitlab.com/octalene
>
> Sent with Proton Mail <https://proton.me/> secure email.
>
> ------- Original Message -------
> On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <sw...@uw.edu>
> wrote:
>
> What's the right way to convert Arrow arrays to numpy arrays in the
> presence of nulls?
>
> The first thing I reach for is array.to_numpy(zero_safe_copy=False). But
> this has some behaviors that I found a little undesirable.
>
> For numeric data (or at least int64 and float64), nulls are converted to
> floating point NaNs and the resulting numpy array is recast from integer to
> floating point. For example:
>
> >>> pa.array([1, 2, 3, None, 5])
> <pyarrow.lib.Int64Array object at 0x111b970a0>
> [
> 1,
> 2,
> 3,
> null,
> 5
> ]
> >>> a.to_numpy(False)
> array([ 1., 2., 3., nan, 5.])
>
> This can be problematic: *actual* floating point NaNs are mixed with
> nulls, which is lossy:
>
> >>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
> array([ 1., 2., nan, nan])
>
> Boolean arrays get converted into 'object'-dtyped numpy arrays, with
> 'True', 'False', and 'None', which is a little undesirable as well.
>
> One tool in numpy for dealing with nullable data is masked arrays (
> https://numpy.org/doc/stable/reference/maskedarray.html) which work
> somewhat like Arrow arrays' validity bitmap. I was thinking of writing some
> code that generates a numpy masked array from an arrow array, but I'd need
> to get the validity bitmap itself, and it doesn't seem to be accessible in
> any pyarrow APIs. Am I missing it?
>
> Or, am I thinking about this wrong, and there's some other way to pull
> nullable data out of arrow and into numpy?
>
> Thanks,
> Spencer
>
>
>
>

Re: Python: Array.to_numpy(), nullable data, and masked arrays

Posted by Aldrin <oc...@pm.me>.
I think per [1] and [2], because your data has null values, there is no good and supported approach to a zero-copy conversion to pandas or numpy. So, I think [3] to drop nulls, then use to_numpy() is the path of least resistance.


If you want to try and do the masked array approach, you need to go from: (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the Buffer as appropriate.


For (1), see [4]. For (2), see [5]. Then, [6] explains that for a fixed-width primitive data type, the first buffer is the validity bitmap. I am not sure that floats are fixed width, but I think they are. I know that Decimal types are a binary format.


I think [7] will be helpful to see how the validity bitmap is used in C++, not sure how familiar you are, but I'm not sure how far down the rabbit hole you'd have to go to use the validity bitmap from python.





[1]: https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions

[2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy

[3]: https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null

[4]: https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219

[5]: https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173

[6]: https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout

[7]: https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102




# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene


Sent with Proton Mail secure email.

------- Original Message -------
On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <sw...@uw.edu> wrote:


> What's the right way to convert Arrow arrays to numpy arrays in the presence of nulls?
> The first thing I reach for is array.to_numpy(zero_safe_copy=False). But this has some behaviors that I found a little undesirable.
> 

> For numeric data (or at least int64 and float64), nulls are converted to floating point NaNs and the resulting numpy array is recast from integer to floating point. For example:
> 

> >>> pa.array([1, 2, 3, None, 5])
> <pyarrow.lib.Int64Array object at 0x111b970a0>
> [
> 1,
> 2,
> 3,
> null,
> 5
> ]
> >>> a.to_numpy(False)
> array([ 1., 2., 3., nan, 5.])
> This can be problematic: actual floating point NaNs are mixed with nulls, which is lossy:
> 

> >>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
> array([ 1., 2., nan, nan])
> 

> Boolean arrays get converted into 'object'-dtyped numpy arrays, with 'True', 'False', and 'None', which is a little undesirable as well.
> 

> One tool in numpy for dealing with nullable data is masked arrays (https://numpy.org/doc/stable/reference/maskedarray.html) which work somewhat like Arrow arrays' validity bitmap. I was thinking of writing some code that generates a numpy masked array from an arrow array, but I'd need to get the validity bitmap itself, and it doesn't seem to be accessible in any pyarrow APIs. Am I missing it?
> 

> Or, am I thinking about this wrong, and there's some other way to pull nullable data out of arrow and into numpy?
> 

> Thanks,
> Spencer
> 

>