You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Suresh V <su...@gmail.com> on 2021/04/07 17:41:14 UTC

[Python] Run multiple pc.compute functions on chunks in single pass

Hi .. I am trying to compute aggregates on large datasets (100GB) stored in
parquet format. Current approach is to use scan/fragement to load chunks
iteratively into memory and would like to run the equivalent of following
on each chunk using pc.compute functions

df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max'])

My understanding is that pc.compute needs to scan the entire array for each
of the functions. Please let me know if that is not the case and how to
optimize it.

Thanks

Re: [Python] Run multiple pc.compute functions on chunks in single pass

Posted by Suresh V <su...@gmail.com>.

Thank you very much for the response @wesm. Looking forward to the changes
and hopefully gaining enough knowledge to start contributing to the
project. I am planning to go the cython route with a custom aggregator for
now. Tbh I am not sure how much we gain by doing single pass vs potential
loss due to cpu friendly  vectorization.

On Wed, Apr 7, 2021 at 1:53 PM Wes McKinney <we...@gmail.com> wrote:

> We are working on implementing a streaming aggregation to be available in
> Python but it probably won’t be available until the 5.0 release. I am not
> sure solving this problem efficiently is possible at 100GB scale with the
> tools currently available in pyarrow.
>
> On Wed, Apr 7, 2021 at 12:41 PM Suresh V <su...@gmail.com> wrote:
>
>> Hi .. I am trying to compute aggregates on large datasets (100GB) stored
>> in parquet format. Current approach is to use scan/fragement to load chunks
>> iteratively into memory and would like to run the equivalent of following
>> on each chunk using pc.compute functions
>>
>> df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max'])
>>
>> My understanding is that pc.compute needs to scan the entire array for
>> each of the functions. Please let me know if that is not the case and how
>> to optimize it.
>>
>> Thanks
>>
>

Re: [Python] Run multiple pc.compute functions on chunks in single pass

Posted by Wes McKinney <we...@gmail.com>.

We are working on implementing a streaming aggregation to be available in
Python but it probably won’t be available until the 5.0 release. I am not
sure solving this problem efficiently is possible at 100GB scale with the
tools currently available in pyarrow.

On Wed, Apr 7, 2021 at 12:41 PM Suresh V <su...@gmail.com> wrote:

> Hi .. I am trying to compute aggregates on large datasets (100GB) stored
> in parquet format. Current approach is to use scan/fragement to load chunks
> iteratively into memory and would like to run the equivalent of following
> on each chunk using pc.compute functions
>
> df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max'])
>
> My understanding is that pc.compute needs to scan the entire array for
> each of the functions. Please let me know if that is not the case and how
> to optimize it.
>
> Thanks
>