You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Weston Pace <we...@gmail.com> on 2022/06/01 00:24:55 UTC

Re: [Python] Pyarrow Computation inplace?

I'd be more interested in some kind of buffer / array pool plus the
ability to specify an output buffer for a kernel function.  I think it
would achieve the same goal (avoiding allocation) with more
flexibility (e.g. you wouldn't have to overwrite your input buffer).

At the moment though I wonder if this is a concern.  Jemalloc should
do some level of memory reuse.  Is there a specific performance issue
you are encountering?

On Tue, May 31, 2022 at 11:45 AM Wes McKinney <we...@gmail.com> wrote:
>
> *In principle*, it would be possible to provide mutable output buffers
> for a kernel's execution, so that input and output buffers could be
> the same (essentially exposing the lower-level kernel execution
> interface that underlies arrow::compute::CallFunction). But this would
> be a fair amount of development work to achieve. If there are others
> interested in exploring an implementation, we could create a Jira
> issue.
>
> On Sun, May 29, 2022 at 3:04 PM Micah Kornfield <em...@gmail.com> wrote:
> >
> > I think even in cython this might be difficult as Array data structures are generally considered immutable, so this is inherently unsafe, and requires doing with care.
> >
> > On Sun, May 29, 2022 at 11:21 AM Cedric Yau <ce...@gmail.com> wrote:
> >>
> >> Suppose I have an array with 1MM integers and I add 1 to them with pyarrow.compute.add.  It looks like a new array is assigned.
> >>
> >> Is there a way to do this inplace?  It looks like a new array is allocated.  Would cython be required at this point?
> >>
> >> ```
> >> import pyarrow as pa
> >> import pyarrow.compute as pc
> >>
> >> a = pa.array(range(1000000))
> >> print(id(a))
> >> a = pc.add(a,1)
> >> print(id(a))
> >>
> >> # output
> >> # 139634974909024
> >> # 139633492705920
> >> ```
> >>
> >> Thanks,
> >> Cedric

Re: [C++] Managing max memory usage in Arrow

Posted by HK Verma <hk...@gmail.com>.
Thanks I will try this out.

On Tue, May 31, 2022 at 9:16 PM Micah Kornfield <em...@gmail.com>
wrote:

> It should be possible to create a custom memory pool for this purpose and
> pass it through most APIs.
>
> On Tue, May 31, 2022 at 5:36 PM HK Verma <hk...@gmail.com> wrote:
>
>> We have an application where a limited amount of memory is present for an
>> Arrow based Linux service. I can put resource constraints while starting
>> the service in Linux
>> <https://www.freedesktop.org/software/systemd/man/systemd.exec.html#LimitCPU=>.
>> However, I am wondering if it is possible, or recommended, to create  a
>> custom MemoryPool <https://arrow.apache.org/docs/cpp/memory.html>
>> limiting memory use.
>> Thanks,
>> HK
>>
>

Re: [C++] Managing max memory usage in Arrow

Posted by Micah Kornfield <em...@gmail.com>.
It should be possible to create a custom memory pool for this purpose and
pass it through most APIs.

On Tue, May 31, 2022 at 5:36 PM HK Verma <hk...@gmail.com> wrote:

> We have an application where a limited amount of memory is present for an
> Arrow based Linux service. I can put resource constraints while starting
> the service in Linux
> <https://www.freedesktop.org/software/systemd/man/systemd.exec.html#LimitCPU=>.
> However, I am wondering if it is possible, or recommended, to create  a
> custom MemoryPool <https://arrow.apache.org/docs/cpp/memory.html>
> limiting memory use.
> Thanks,
> HK
>

[C++] Managing max memory usage in Arrow

Posted by HK Verma <hk...@gmail.com>.
We have an application where a limited amount of memory is present for an
Arrow based Linux service. I can put resource constraints while starting
the service in Linux
<https://www.freedesktop.org/software/systemd/man/systemd.exec.html#LimitCPU=>.
However, I am wondering if it is possible, or recommended, to create  a
custom MemoryPool <https://arrow.apache.org/docs/cpp/memory.html> limiting
memory use.
Thanks,
HK

Re: [Python] Pyarrow Computation inplace?

Posted by Cedric Yau <ce...@gmail.com>.
Thanks Wes.  Makes sense that compute expressions will be the direction to
go.

I finally found the use of expressions when searching for "Projection":
https://arrow.apache.org/docs/python/dataset.html

And the result is that running the following with and without output_file
represents a diff of about 180ms, compared to about 2 seconds before, so
this is definitely 10x faster!  (The source parquet file has 33MM records.)

The key learning here is that I need to read *and* project simultaneously,
not read *then* project.

projection = {
 'username': ds.field('username'),
 'output_file':
    pc.if_else( pc.is_null(ds.field('username')),pc.scalar(9),
                pc.if_else(ds.field('username') < 'Bil', pc.scalar(0),
                pc.if_else(ds.field('username') < 'Cou', pc.scalar(1),
                pc.if_else(ds.field('username') < 'Eve', pc.scalar(2),
                pc.if_else(ds.field('username') < 'IZh', pc.scalar(3),
                pc.if_else(ds.field('username') < 'Kib', pc.scalar(4),
                pc.if_else(ds.field('username') < 'Mat', pc.scalar(5),
                pc.if_else(ds.field('username') < 'Pat', pc.scalar(6),
                pc.if_else(ds.field('username') < 'Sco', pc.scalar(7),
                pc.if_else(ds.field('username') < 'Tok', pc.scalar(8),
                           pc.scalar(9)


                          ))))))))))
}
t = pq.read_table('/path/to/file', columns=projection)

Best,
Cedric

On Wed, Jun 1, 2022 at 9:04 AM Wes McKinney <we...@gmail.com> wrote:

> > we may need to consider c++ implementations to get full benefit.
>
> The way that you are executing your expression is not the way that we
> intend users to write performant code.
>
> The intended way is to build an expression and then execute the
> expression — the current expression evaluator is not very efficient
> (it does not reuse temporary allocations yet), but that is where the
> optimization work should happen rather than in custom C++ code.
>
> I'm struggling to find documentation for this right now (expressions
> aren't discussed in https://arrow.apache.org/docs/python/compute.html)
> but there are many examples in the unit tests:
>
>
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_compute.py
>
> On Tue, May 31, 2022 at 11:25 PM Cedric Yau <ce...@gmail.com> wrote:
> >
> > I have code like below to range partition a file.  It looks like each of
> the pc.less, pc.cast, and pc.add allocates new arrays.  So the code appears
> to be spending more time performing memory allocations than it is doing the
> comparisons.  The performance is still pretty good (and faster than the
> alternatives), but it does make me think as we start shifting more
> calculations to arrow, we may need to consider c++ implementations to get
> full benefit.
> >
> > Thanks,
> > Cedric
> >
> > import pyarrow.parquet as pq
> > import pyarrow.compute as pc
> >
> > t = pq.read_table('/path/to/file', columns=['username'])
> > ta = t.column('username')
> >
> > output_file = pc.cast( pc.less(ta,'Bil'), 'int8')
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Cou'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Eve'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Ish'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Kib'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Mat'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Pat'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Sco'), 'int8'))
> > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Tok'), 'int8'))
> > output_file = pc.coalesce(output_file,9)
> >
> > On Tue, May 31, 2022 at 2:25 PM Weston Pace <we...@gmail.com>
> wrote:
> >>
> >> I'd be more interested in some kind of buffer / array pool plus the
> >> ability to specify an output buffer for a kernel function.  I think it
> >> would achieve the same goal (avoiding allocation) with more
> >> flexibility (e.g. you wouldn't have to overwrite your input buffer).
> >>
> >> At the moment though I wonder if this is a concern.  Jemalloc should
> >> do some level of memory reuse.  Is there a specific performance issue
> >> you are encountering?
> >>
> >> On Tue, May 31, 2022 at 11:45 AM Wes McKinney <we...@gmail.com>
> wrote:
> >> >
> >> > *In principle*, it would be possible to provide mutable output buffers
> >> > for a kernel's execution, so that input and output buffers could be
> >> > the same (essentially exposing the lower-level kernel execution
> >> > interface that underlies arrow::compute::CallFunction). But this would
> >> > be a fair amount of development work to achieve. If there are others
> >> > interested in exploring an implementation, we could create a Jira
> >> > issue.
> >> >
> >> > On Sun, May 29, 2022 at 3:04 PM Micah Kornfield <
> emkornfield@gmail.com> wrote:
> >> > >
> >> > > I think even in cython this might be difficult as Array data
> structures are generally considered immutable, so this is inherently
> unsafe, and requires doing with care.
> >> > >
> >> > > On Sun, May 29, 2022 at 11:21 AM Cedric Yau <ce...@gmail.com>
> wrote:
> >> > >>
> >> > >> Suppose I have an array with 1MM integers and I add 1 to them with
> pyarrow.compute.add.  It looks like a new array is assigned.
> >> > >>
> >> > >> Is there a way to do this inplace?  It looks like a new array is
> allocated.  Would cython be required at this point?
> >> > >>
> >> > >> ```
> >> > >> import pyarrow as pa
> >> > >> import pyarrow.compute as pc
> >> > >>
> >> > >> a = pa.array(range(1000000))
> >> > >> print(id(a))
> >> > >> a = pc.add(a,1)
> >> > >> print(id(a))
> >> > >>
> >> > >> # output
> >> > >> # 139634974909024
> >> > >> # 139633492705920
> >> > >> ```
> >> > >>
> >> > >> Thanks,
> >> > >> Cedric
> >
> >
> >
>

Re: [Python] Pyarrow Computation inplace?

Posted by Wes McKinney <we...@gmail.com>.
> we may need to consider c++ implementations to get full benefit.

The way that you are executing your expression is not the way that we
intend users to write performant code.

The intended way is to build an expression and then execute the
expression — the current expression evaluator is not very efficient
(it does not reuse temporary allocations yet), but that is where the
optimization work should happen rather than in custom C++ code.

I'm struggling to find documentation for this right now (expressions
aren't discussed in https://arrow.apache.org/docs/python/compute.html)
but there are many examples in the unit tests:

https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_compute.py

On Tue, May 31, 2022 at 11:25 PM Cedric Yau <ce...@gmail.com> wrote:
>
> I have code like below to range partition a file.  It looks like each of the pc.less, pc.cast, and pc.add allocates new arrays.  So the code appears to be spending more time performing memory allocations than it is doing the comparisons.  The performance is still pretty good (and faster than the alternatives), but it does make me think as we start shifting more calculations to arrow, we may need to consider c++ implementations to get full benefit.
>
> Thanks,
> Cedric
>
> import pyarrow.parquet as pq
> import pyarrow.compute as pc
>
> t = pq.read_table('/path/to/file', columns=['username'])
> ta = t.column('username')
>
> output_file = pc.cast( pc.less(ta,'Bil'), 'int8')
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Cou'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Eve'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Ish'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Kib'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Mat'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Pat'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Sco'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Tok'), 'int8'))
> output_file = pc.coalesce(output_file,9)
>
> On Tue, May 31, 2022 at 2:25 PM Weston Pace <we...@gmail.com> wrote:
>>
>> I'd be more interested in some kind of buffer / array pool plus the
>> ability to specify an output buffer for a kernel function.  I think it
>> would achieve the same goal (avoiding allocation) with more
>> flexibility (e.g. you wouldn't have to overwrite your input buffer).
>>
>> At the moment though I wonder if this is a concern.  Jemalloc should
>> do some level of memory reuse.  Is there a specific performance issue
>> you are encountering?
>>
>> On Tue, May 31, 2022 at 11:45 AM Wes McKinney <we...@gmail.com> wrote:
>> >
>> > *In principle*, it would be possible to provide mutable output buffers
>> > for a kernel's execution, so that input and output buffers could be
>> > the same (essentially exposing the lower-level kernel execution
>> > interface that underlies arrow::compute::CallFunction). But this would
>> > be a fair amount of development work to achieve. If there are others
>> > interested in exploring an implementation, we could create a Jira
>> > issue.
>> >
>> > On Sun, May 29, 2022 at 3:04 PM Micah Kornfield <em...@gmail.com> wrote:
>> > >
>> > > I think even in cython this might be difficult as Array data structures are generally considered immutable, so this is inherently unsafe, and requires doing with care.
>> > >
>> > > On Sun, May 29, 2022 at 11:21 AM Cedric Yau <ce...@gmail.com> wrote:
>> > >>
>> > >> Suppose I have an array with 1MM integers and I add 1 to them with pyarrow.compute.add.  It looks like a new array is assigned.
>> > >>
>> > >> Is there a way to do this inplace?  It looks like a new array is allocated.  Would cython be required at this point?
>> > >>
>> > >> ```
>> > >> import pyarrow as pa
>> > >> import pyarrow.compute as pc
>> > >>
>> > >> a = pa.array(range(1000000))
>> > >> print(id(a))
>> > >> a = pc.add(a,1)
>> > >> print(id(a))
>> > >>
>> > >> # output
>> > >> # 139634974909024
>> > >> # 139633492705920
>> > >> ```
>> > >>
>> > >> Thanks,
>> > >> Cedric
>
>
>

Re: [Python] Pyarrow Computation inplace?

Posted by Cedric Yau <ce...@gmail.com>.
I have code like below to range partition a file.  It looks like each of
the pc.less, pc.cast, and pc.add allocates new arrays.  So the code appears
to be spending more time performing memory allocations than it is doing the
comparisons.  The performance is still pretty good (and faster than the
alternatives), but it does make me think as we start shifting more
calculations to arrow, we may need to consider c++ implementations to get
full benefit.

Thanks,
Cedric

import pyarrow.parquet as pq
import pyarrow.compute as pc

t = pq.read_table('/path/to/file', columns=['username'])
ta = t.column('username')

output_file = pc.cast( pc.less(ta,'Bil'), 'int8')
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Cou'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Eve'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Ish'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Kib'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Mat'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Pat'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Sco'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Tok'), 'int8'))
output_file = pc.coalesce(output_file,9)

On Tue, May 31, 2022 at 2:25 PM Weston Pace <we...@gmail.com> wrote:

> I'd be more interested in some kind of buffer / array pool plus the
> ability to specify an output buffer for a kernel function.  I think it
> would achieve the same goal (avoiding allocation) with more
> flexibility (e.g. you wouldn't have to overwrite your input buffer).
>
> At the moment though I wonder if this is a concern.  Jemalloc should
> do some level of memory reuse.  Is there a specific performance issue
> you are encountering?
>
> On Tue, May 31, 2022 at 11:45 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > *In principle*, it would be possible to provide mutable output buffers
> > for a kernel's execution, so that input and output buffers could be
> > the same (essentially exposing the lower-level kernel execution
> > interface that underlies arrow::compute::CallFunction). But this would
> > be a fair amount of development work to achieve. If there are others
> > interested in exploring an implementation, we could create a Jira
> > issue.
> >
> > On Sun, May 29, 2022 at 3:04 PM Micah Kornfield <em...@gmail.com>
> wrote:
> > >
> > > I think even in cython this might be difficult as Array data
> structures are generally considered immutable, so this is inherently
> unsafe, and requires doing with care.
> > >
> > > On Sun, May 29, 2022 at 11:21 AM Cedric Yau <ce...@gmail.com>
> wrote:
> > >>
> > >> Suppose I have an array with 1MM integers and I add 1 to them with
> pyarrow.compute.add.  It looks like a new array is assigned.
> > >>
> > >> Is there a way to do this inplace?  It looks like a new array is
> allocated.  Would cython be required at this point?
> > >>
> > >> ```
> > >> import pyarrow as pa
> > >> import pyarrow.compute as pc
> > >>
> > >> a = pa.array(range(1000000))
> > >> print(id(a))
> > >> a = pc.add(a,1)
> > >> print(id(a))
> > >>
> > >> # output
> > >> # 139634974909024
> > >> # 139633492705920
> > >> ```
> > >>
> > >> Thanks,
> > >> Cedric
>