You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "paleolimbot (via GitHub)" <gi...@apache.org> on 2024/03/19 01:25:34 UTC

Re: [I] [Python] Conventions around PyCapsule Interface and choosing Array/Stream export [arrow]

paleolimbot commented on issue #40648:
URL: https://github.com/apache/arrow/issues/40648#issuecomment-2005564663

> nanoarrow implements both `__arrow_c_array__` and `__arrow_c_stream__`

For reference, the PR implementing the `nanoarrow.Array` is https://github.com/apache/arrow-nanoarrow/pull/396 . It is basically a ChunkedArray and is currently the only planned user-facing Arrayish thing, although it's all very new (feel free to comment on that PR!). Basically, I found that maintaining both a chunked and a non-chunked pathway in geoarrow-pyarrow resulted in a lot of Python loops over chunks and I wanted to avoid forcing nanoarrow users to maintain two pathways. Many pyarrow methods might give you back a `Array` or a `ChunkedArray`; however, many `ChunkedArray`s only have one chunk. The whole thing is imperfect and a bit of a compromise.

> Fundamentally, my question is whether the existence of methods on an object should allow for an inference of its storage type

My take on this is that as long as the object has an unambiguous interpretation as a contiguous array (or *might* have one, since it might take a loop over something that is not already Arrow data to figure this out), I think it's fine for `__arrow_c_array__` to exist. As long as an object has an unambiguous interpretation as zero or more arrays (or *might* have one), I think `__arrow_c_stream__` can exist. I don't see those as mutually exclusive...for me this is like `pyarrow.array()` returning either a `ChunkedArray` or an `Array`: it just doesn't know until it sees the input what type it needs to unambiguously represent it.

For something like an `Array` or `RecordBatch` (or something like a `numpy` array) that is definitely Arrow and is definitely contiguous, I am not sure what the benefit would be for `__arrow_c_stream__` to exist and it is probably just confusing if it does.

There are other assumptions that can't be captured by the mere existence of either of those, like exactly how expensive it will be to call any one of those methods. In https://github.com/shapely/shapely/pull/1953 both are fairly expensive because the data are not Arrow yet. For a database driver, it might expensive to consume the stream because the data haven't arrived over the network yet.

The Python buffer protocol has a `flags` field to handle consumer requests along these lines (like a request for contiguous, rather than strided, memory) that could be used to disambiguate some of these cases if it turns out that disambiguating them is important. It is also careful to note that the existence of the buffer protocol implementation does not imply that attempting to get the buffer will succeed.

For consuming in nanoarrow, the current approach is to use `__arrow_c_stream__` whenever possible since this has the fewest constraints (arrays need not be in memory yet, need not be contiguous, might not be fully consumed). Then it falls back on `__arrow_c_array__`. The entrypoint is `nanoarrow.c_array_stream()`, which will happily accept either (generates a length-one stream if needed).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org