You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "kylebarron (via GitHub)" <gi...@apache.org> on 2024/03/18 21:07:02 UTC

[I] [Python] Conventions around PyCapsule Interface and choosing Array/Stream export [arrow]

kylebarron opened a new issue, #40648:
URL: https://github.com/apache/arrow/issues/40648

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   👋 I've been excited about the PyCapsule interface, and have been implementing it in my [geoarrow-rust project](https://github.com/geoarrow/geoarrow-rs). Every function call accepts any Arrow PyCapsule interface object, no matter its producer. It's really amazing!
   
   Fundamentally, my question is whether the existence of methods on an object should allow for an inference of its storage type. That is, should it be possible to observe whether a producer object is chunked or not based on whether it exports `__arrow_c_array__` or `__arrow_c_stream__`? I had been expecting yes, but this question came up [here](https://github.com/shapely/shapely/pull/1953#discussion_r1529247534), where nanoarrow implements _both_ `__arrow_c_array__` and `__arrow_c_stream__` 
   
   
   I'd argue that it's simpler to only define a single type of export method on a class and allow the consumer to convert to a different representation if they need. 
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] Conventions around PyCapsule Interface and choosing Array/Stream export [arrow]

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #40648:
URL: https://github.com/apache/arrow/issues/40648#issuecomment-2005564663

   > nanoarrow implements both `__arrow_c_array__` and `__arrow_c_stream__`
   
   For reference, the PR implementing the `nanoarrow.Array` is https://github.com/apache/arrow-nanoarrow/pull/396 . It is basically a ChunkedArray and is currently the only planned user-facing Arrayish thing, although it's all very new (feel free to comment on that PR!). Basically, I found that maintaining both a chunked and a non-chunked pathway in geoarrow-pyarrow resulted in a lot of Python loops over chunks and I wanted to avoid forcing nanoarrow users to maintain two pathways. Many pyarrow methods might give you back a `Array` or a `ChunkedArray`; however, many `ChunkedArray`s only have one chunk. The whole thing is imperfect and a bit of a compromise.
   
   > Fundamentally, my question is whether the existence of methods on an object should allow for an inference of its storage type
   
   My take on this is that as long as the object has an unambiguous interpretation as a contiguous array (or *might* have one, since it might take a loop over something that is not already Arrow data to figure this out), I think it's fine for `__arrow_c_array__` to exist.  As long as an object has an unambiguous interpretation as zero or more arrays (or *might* have one), I think `__arrow_c_stream__` can exist. I don't see those as mutually exclusive...for me this is like `pyarrow.array()` returning either a `ChunkedArray` or an `Array`: it just doesn't know until it sees the input what type it needs to unambiguously represent it.
   
   For something like an `Array` or `RecordBatch` (or something like a `numpy` array) that is definitely Arrow and is definitely contiguous, I am not sure what the benefit would be for `__arrow_c_stream__` to exist and it is probably just confusing if it does.
   
   There are other assumptions that can't be captured by the mere existence of either of those, like exactly how expensive it will be to call any one of those methods. In https://github.com/shapely/shapely/pull/1953  both are fairly expensive because the data are not Arrow yet. For a database driver, it might expensive to consume the stream because the data haven't arrived over the network yet.
   
   The Python buffer protocol has a `flags` field to handle consumer requests along these lines (like a request for contiguous, rather than strided, memory) that could be used to disambiguate some of these cases if it turns out that disambiguating them is important. It is also careful to note that the existence of the buffer protocol implementation does not imply that attempting to get the buffer will succeed.
   
   For consuming in nanoarrow, the current approach is to use `__arrow_c_stream__` whenever possible since this has the fewest constraints (arrays need not be in memory yet, need not be contiguous, might not be fully consumed). Then it falls back on `__arrow_c_array__`. The entrypoint is `nanoarrow.c_array_stream()`, which will happily accept either (generates a length-one stream if needed).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] Conventions around PyCapsule Interface and choosing Array/Stream export [arrow]

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on issue #40648:
URL: https://github.com/apache/arrow/issues/40648#issuecomment-2008587961

   > for me this is like `pyarrow.array()` returning either a `ChunkedArray` or an `Array`: it just doesn't know until it sees the input what type it needs to unambiguously represent it.
   
   Does `pyarrow.array` ever return a `ChunkedArray`? I tried to pass in a list of lists and it inferred a `ListArray`. I thought `pyarrow.array` only ever returned an `Array` and `pyarrow.chunked_array` only ever returned a `ChunkedArray`?
   
   > For something like an `Array` or `RecordBatch` (or something like a `numpy` array) that is definitely Arrow and is definitely contiguous, I am not sure what the benefit would be for `__arrow_c_stream__` to exist and it is probably just confusing if it does.
   
   So your argument is that `Array` should never have `__arrow_c_stream__`, but that `ChunkedArray` should have both `__arrow_c_array__` and `__arrow_c_stream__`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] Conventions around PyCapsule Interface and choosing Array/Stream export [arrow]

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #40648:
URL: https://github.com/apache/arrow/issues/40648#issuecomment-2008617788

   > Does pyarrow.array ever return a ChunkedArray?
   
   ```python
   import pyarrow as pa
   
   type(pa.array(["a" * 2 ** 20 for _ in range(2**10)]))
   #> pyarrow.lib.StringArray
   type(pa.array(["a" * 2 ** 20 for _ in range(2**11)]))
   #> pyarrow.lib.ChunkedArray
   ```
   
   This is also true of the [`__arrow_array__` protocol](https://arrow.apache.org/docs/python/extending_types.html#controlling-conversion-to-pyarrow-array-with-the-arrow-array-protocol).
   
   > I'm not sure which overload a type checker would pick if the input object had both dunder methods.
   
   Would the typing hints be significantly different for a `ChunkedPointArray` vs a `PointArray`?
   
   > So your argument is that Array should never have __arrow_c_stream__, but that ChunkedArray should have both __arrow_c_array__ and __arrow_c_stream__?
   
   Maybe *could* is more like it. pyarrow has an `Array` class for a specifically contiguous array...nanoarrow doesn't at the moment (at least in a user-facing nicely typed sort of way).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] Conventions around PyCapsule Interface and choosing Array/Stream export [arrow]

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on issue #40648:
URL: https://github.com/apache/arrow/issues/40648#issuecomment-2008583634

   Being able to infer the input structure also significantly helps static typing. For example, I have type hints that I'm writing for geoarrow-rust that [include](https://github.com/geoarrow/geoarrow-rs/blob/cfe91bbe15c20b8fd814f7e93294f9c134b9d87a/python/core/python/geoarrow/rust/core/_rust.pyi#L1199-L1205):
   
   ```py
   @overload
   def centroid(input: ArrowArrayExportable) -> PointArray: ...
   @overload
   def centroid(input: ArrowStreamExportable) -> ChunkedPointArray: ...
   def centroid(
       input: ArrowArrayExportable | ArrowStreamExportable,
   ) -> PointArray | ChunkedPointArray: ...
   ```
   
   I'm not sure which overload a type checker would pick if the input object had _both_ dunder methods. I suppose it would always return the union. But being able to use structural types in this way is quite useful for static type checking and IDE autocompletion, which are really sore spots right now with pyarrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org