You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/03 08:06:27 UTC

[GitHub] [arrow] madhavajay opened a new issue #12553: Support for Compute Functions on Nested Arrays

madhavajay opened a new issue #12553:
URL: https://github.com/apache/arrow/issues/12553


   Hi,
   I have read through the docs and issues as best as I can and I am under the impression that its not possible to do compute functions on nested arrays.
   
   I modified a group_by & aggregate example like so, putting the pa.array values into nested lists.
   ```
   t = pa.table([
         pa.array(["a", "a", "b", "b", "c"]),
         pa.array([[1], [2], [3], [4], [5]]),
   ], names=["keys", "values"])
   
   t.group_by("keys").aggregate([("values", "sum")])
   ```
   
   The error is this:
   ```
   ArrowNotImplementedError: Function 'hash_sum' has no kernel matching input types (array[list<item: int64>], array[uint32])
   ```
   
   I assume this means the function doesn't know how to operate on a list? Is there a way to do this? I have large tensors which I can reshape into 1 dimension to store in a Record Batch, but I don't know how I can perform computations on their values. It seems like the other way is to use the Tensor type but it can't be used in a Record Batch or with compute can it?
   
   The PyArrow zero copy from Numpy means this is an effective way to get data across the network using the IPC writer and its fairly easy to add other record types for custom meta data, but it would be a pity to have to then send this data back to numpy for all my computations and lose out on all that great SIMD parallelization.
   
   Is there a better way?
   
   Related links:
   https://github.com/apache/arrow/issues/4802
   https://issues.apache.org/jira/browse/ARROW-1614


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org