You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Marc Garcia <ga...@gmail.com> on 2020/09/12 19:16:18 UTC

Implementation independent __arrow_array__

Hi there,

I'm writing a document analyzing different options for a Python dataframe
exchange protocol. And I wanted to ask a question regarding the
__arrow_array__ protocol.

I checked the code, and looks like the producer is expected to be sending
an Arrow array, and the consumer just receives it. This is the code I'm
checking, I guess it's the right one:
https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L110

Compared to the array interface (the NumPy buffer protocol), it works a bit
differently. In the NumPy one, the producer exposes the pointer, the
size... So, the producer doesn't need to depend on NumPy or any other
library, and then the consumer can simply use `numpy.array(obj)` and
generate the NumPy array. Or if other implementations support the protocol
(not sure if they do), they could call something like
`tensorflow.Tensor(obj)`, and NumPy would not be used at all.

Am I understanding correctly the `__arrow_array__` protocol? And if I am,
is there anything else similar to the NumPy protocol that can be used to
exchange data without relying on a particular implementation?

Thanks in advance!

Re: Implementation independent __arrow_array__

Posted by Joris Van den Bossche <jo...@gmail.com>.
In addition to Wes' reference to the Arrow C data interface, I think it is
also important to clarify some aspects.

In numpy, you have the "array interface" (`__array_interface__` property)
and the "array dunder method" (`__array__` method). When speaking about the
array protocol typically the first is meant I think (although this can
easily be confusing I think) and this is what exposes the actual memory
buffer (generalized by the python buffer protocol). But in practice, many
custom array-like containers (eg pandas, xarray, ..) actually implement the
second option to ensure numpy knows how to convert this container to a
numpy array and operate on it.

And the __array__ method also requires that an actual numpy.ndarray is
returned (can be tested with a small example, or inferred from the code
<https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L2107-L2151>).

So the __arrow_array__ method should rather be compared with numpy's
__array__ method instead of the __array_interface__ property, and thus
actually works exactly the same as the __array__ method regarding the
return type. Then, for an equivalent of numpy's __array_interface__ (or
more in general the python buffer protocol), it's indeed correct to point
to the Arrow C data interface.

Maybe it could make sense to at some point add an
"__arrow_array_interface__" dunder method to make it easier to expose this
from Python. But I am not very familiar with the details how this could
work (currently a specific c struct is expected, and not a python dict like
the numpy array interface).

Joris

On Sat, 12 Sep 2020 at 22:21, Wes McKinney <we...@gmail.com> wrote:

> Adding dev@
>
> The is one purpose of the Arrow C data interface, which was developed
> after the __arrow_array__ protocol, and worth investigating
>
>
> https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst
>
> On Sat, Sep 12, 2020 at 2:16 PM Marc Garcia <ga...@gmail.com> wrote:
> >
> > Hi there,
> >
> > I'm writing a document analyzing different options for a Python
> dataframe exchange protocol. And I wanted to ask a question regarding the
> __arrow_array__ protocol.
> >
> > I checked the code, and looks like the producer is expected to be
> sending an Arrow array, and the consumer just receives it. This is the code
> I'm checking, I guess it's the right one:
> https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L110
> >
> > Compared to the array interface (the NumPy buffer protocol), it works a
> bit differently. In the NumPy one, the producer exposes the pointer, the
> size... So, the producer doesn't need to depend on NumPy or any other
> library, and then the consumer can simply use `numpy.array(obj)` and
> generate the NumPy array. Or if other implementations support the protocol
> (not sure if they do), they could call something like
> `tensorflow.Tensor(obj)`, and NumPy would not be used at all.
> >
> > Am I understanding correctly the `__arrow_array__` protocol? And if I
> am, is there anything else similar to the NumPy protocol that can be used
> to exchange data without relying on a particular implementation?
> >
> > Thanks in advance!
>

Re: Implementation independent __arrow_array__

Posted by Joris Van den Bossche <jo...@gmail.com>.
In addition to Wes' reference to the Arrow C data interface, I think it is
also important to clarify some aspects.

In numpy, you have the "array interface" (`__array_interface__` property)
and the "array dunder method" (`__array__` method). When speaking about the
array protocol typically the first is meant I think (although this can
easily be confusing I think) and this is what exposes the actual memory
buffer (generalized by the python buffer protocol). But in practice, many
custom array-like containers (eg pandas, xarray, ..) actually implement the
second option to ensure numpy knows how to convert this container to a
numpy array and operate on it.

And the __array__ method also requires that an actual numpy.ndarray is
returned (can be tested with a small example, or inferred from the code
<https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L2107-L2151>).

So the __arrow_array__ method should rather be compared with numpy's
__array__ method instead of the __array_interface__ property, and thus
actually works exactly the same as the __array__ method regarding the
return type. Then, for an equivalent of numpy's __array_interface__ (or
more in general the python buffer protocol), it's indeed correct to point
to the Arrow C data interface.

Maybe it could make sense to at some point add an
"__arrow_array_interface__" dunder method to make it easier to expose this
from Python. But I am not very familiar with the details how this could
work (currently a specific c struct is expected, and not a python dict like
the numpy array interface).

Joris

On Sat, 12 Sep 2020 at 22:21, Wes McKinney <we...@gmail.com> wrote:

> Adding dev@
>
> The is one purpose of the Arrow C data interface, which was developed
> after the __arrow_array__ protocol, and worth investigating
>
>
> https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst
>
> On Sat, Sep 12, 2020 at 2:16 PM Marc Garcia <ga...@gmail.com> wrote:
> >
> > Hi there,
> >
> > I'm writing a document analyzing different options for a Python
> dataframe exchange protocol. And I wanted to ask a question regarding the
> __arrow_array__ protocol.
> >
> > I checked the code, and looks like the producer is expected to be
> sending an Arrow array, and the consumer just receives it. This is the code
> I'm checking, I guess it's the right one:
> https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L110
> >
> > Compared to the array interface (the NumPy buffer protocol), it works a
> bit differently. In the NumPy one, the producer exposes the pointer, the
> size... So, the producer doesn't need to depend on NumPy or any other
> library, and then the consumer can simply use `numpy.array(obj)` and
> generate the NumPy array. Or if other implementations support the protocol
> (not sure if they do), they could call something like
> `tensorflow.Tensor(obj)`, and NumPy would not be used at all.
> >
> > Am I understanding correctly the `__arrow_array__` protocol? And if I
> am, is there anything else similar to the NumPy protocol that can be used
> to exchange data without relying on a particular implementation?
> >
> > Thanks in advance!
>

Re: Implementation independent __arrow_array__

Posted by Wes McKinney <we...@gmail.com>.
Adding dev@

The is one purpose of the Arrow C data interface, which was developed
after the __arrow_array__ protocol, and worth investigating

https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst

On Sat, Sep 12, 2020 at 2:16 PM Marc Garcia <ga...@gmail.com> wrote:
>
> Hi there,
>
> I'm writing a document analyzing different options for a Python dataframe exchange protocol. And I wanted to ask a question regarding the __arrow_array__ protocol.
>
> I checked the code, and looks like the producer is expected to be sending an Arrow array, and the consumer just receives it. This is the code I'm checking, I guess it's the right one: https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L110
>
> Compared to the array interface (the NumPy buffer protocol), it works a bit differently. In the NumPy one, the producer exposes the pointer, the size... So, the producer doesn't need to depend on NumPy or any other library, and then the consumer can simply use `numpy.array(obj)` and generate the NumPy array. Or if other implementations support the protocol (not sure if they do), they could call something like `tensorflow.Tensor(obj)`, and NumPy would not be used at all.
>
> Am I understanding correctly the `__arrow_array__` protocol? And if I am, is there anything else similar to the NumPy protocol that can be used to exchange data without relying on a particular implementation?
>
> Thanks in advance!

Re: Implementation independent __arrow_array__

Posted by Wes McKinney <we...@gmail.com>.
Adding dev@

The is one purpose of the Arrow C data interface, which was developed
after the __arrow_array__ protocol, and worth investigating

https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst

On Sat, Sep 12, 2020 at 2:16 PM Marc Garcia <ga...@gmail.com> wrote:
>
> Hi there,
>
> I'm writing a document analyzing different options for a Python dataframe exchange protocol. And I wanted to ask a question regarding the __arrow_array__ protocol.
>
> I checked the code, and looks like the producer is expected to be sending an Arrow array, and the consumer just receives it. This is the code I'm checking, I guess it's the right one: https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L110
>
> Compared to the array interface (the NumPy buffer protocol), it works a bit differently. In the NumPy one, the producer exposes the pointer, the size... So, the producer doesn't need to depend on NumPy or any other library, and then the consumer can simply use `numpy.array(obj)` and generate the NumPy array. Or if other implementations support the protocol (not sure if they do), they could call something like `tensorflow.Tensor(obj)`, and NumPy would not be used at all.
>
> Am I understanding correctly the `__arrow_array__` protocol? And if I am, is there anything else similar to the NumPy protocol that can be used to exchange data without relying on a particular implementation?
>
> Thanks in advance!