You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Spencer Nelson <sw...@uw.edu> on 2023/05/18 17:07:46 UTC

python scalar aggregations on struct arrays

I have a struct array with a few fields. I'd like to compute scalar
aggregations over several of its fields (like computing the min and max of
each field) in a single pass. As a simple case, how about like this:

struct_type = pa.struct([("x", pa.float64()), ("y", pa.float64())])
array = pa.array([
    {"x": 1, "y": 2},
    {"x": 3, "y": 4},
    {"x": 5, "y": 6}
  ],
  struct_type)

I can compute the min_max of "x" and "y' individually:

>>> pc.min_max(pc.struct_field(array, 0))
<pyarrow.StructScalar: [('min', 1.0), ('max', 5.0)]>

>>> pc.min_max(pc.struct_field(array, 1))
<pyarrow.StructScalar: [('min', 2.0), ('max', 6.0)]>

But what I'd really like is some way to apply min_max to the x and y
columns in one go, resulting in something like

<pyarrow.StructScalar: [('x', {'min': 1.0, 'max': 5.0}), ('y', {'min': 2.0,
'max': 6.0})]>

Is this possible from pyarrow?

Re: python scalar aggregations on struct arrays

Posted by Micah Kornfield <em...@gmail.com>.
Hi Spencer,
I'm not aware of a helper method that would do this (but don't have a lot
of expertise in this area of the code).  From a computational perspective,
writing a small helper function in python to do the computation in a loop,
does not really lose efficiency, because of Arrow's column layout (all the
heavy lifting will be pushed down to C++).

Thanks,
Micah

On Thu, May 18, 2023 at 10:08 AM Spencer Nelson <sw...@uw.edu> wrote:

> I have a struct array with a few fields. I'd like to compute scalar
> aggregations over several of its fields (like computing the min and max of
> each field) in a single pass. As a simple case, how about like this:
>
> struct_type = pa.struct([("x", pa.float64()), ("y", pa.float64())])
> array = pa.array([
>     {"x": 1, "y": 2},
>     {"x": 3, "y": 4},
>     {"x": 5, "y": 6}
>   ],
>   struct_type)
>
> I can compute the min_max of "x" and "y' individually:
>
> >>> pc.min_max(pc.struct_field(array, 0))
> <pyarrow.StructScalar: [('min', 1.0), ('max', 5.0)]>
>
> >>> pc.min_max(pc.struct_field(array, 1))
> <pyarrow.StructScalar: [('min', 2.0), ('max', 6.0)]>
>
> But what I'd really like is some way to apply min_max to the x and y
> columns in one go, resulting in something like
>
> <pyarrow.StructScalar: [('x', {'min': 1.0, 'max': 5.0}), ('y', {'min':
> 2.0, 'max': 6.0})]>
>
> Is this possible from pyarrow?
>