You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Uwe L. Korn" <ma...@uwekorn.com> on 2020/04/07 11:00:55 UTC

[C++] Compute: Datum and "ChunkedArray&" inputs

Hello all,

I'm in the progress of changing the implementation of the Take kernel to work on ChunkedArrays without concatenating them into a single Array first. While working on the implementation, I realised that we switch often between Datum and the specific-typed parameters. This works quite well for the combination of Array& and Datum(shared_ptr<ArrayData>) as here the reference object with type Array& always carries a shared reference with it, so switching between Array& and its Datum is quite easy.

In contrast, we cannot do this with ChunkedArrays as here the Datum requires a shared_ptr<ChunkedArray> which cannot be constructed from the reference type. Thus to allow interfaces like `Status Take(FunctionContext* ctx, const ChunkedArray& values, const Array& indices,` to pass successfully their arguments to the Kernel implementation, we have to do:

a) Remove the references from the interface of the Take() function and use `shared_ptr` instances everywhere.
b) Add interfaces to kernels like the TakeKernel that allow calling with specific references instead of Datum instances

Personally I would prefer b) as this allow us to make more use of the C++ type system and would also avoid the shared_ptr overhead where not necessary.

Cheers,
Uwe

Re: [C++] Compute: Datum and "ChunkedArray&" inputs

Posted by Wes McKinney <we...@gmail.com>.
On Thu, Apr 9, 2020, 5:25 AM Antoine Pitrou <so...@pitrou.net> wrote:

>
> It seems there are two different concerns here:
> - the kernels' public API
> - the ease with which kernels can be implemented.
>
> If we need a different public API, then IMO we should change it sooner
> rather than later.
>

Yes, based on the problems I've listed I think some significant time needs
to be invested in this after 0.17.0 is released. I can allocate some of my
own time to it.


As for the implementation, perhaps we should start by drawing
> the main categories of kernels.  For example:
> - item-wise kernels (and/or vector-wise)
> - aggregate kernels
>
> Some kernels will not fall in these categories (e.g. filter, join...),
> but many of them do.
>

Yes, it's certainly important to distinguish been "operators" (which are
data flow processing nodes like joins, aggregations, filters, projections)
and functions / kernels (which perform granular units of work on concrete
batches of data). Some operators can be used in "one-shot" mode (eg
obtaining a result from a static unit of data) though.


Regards
>
> Antoine.
>
>
> On Wed, 8 Apr 2020 16:04:30 -0500
> Wes McKinney <we...@gmail.com> wrote:
> > On Wed, Apr 8, 2020 at 4:01 PM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > Another idea would be to have a variant with const-references instead
> > > of shared_ptr. One potential issue with our Datum is that it plays the
> > > dual role of transporting both input and output arguments. With
> > > outputs it's necessary to be able to convey ownership while with
> > > inputs this is less important.
> > >
> > > In general, I would say that some additional thought is needed about
> > > our kernel APIs. Some scattered thoughts about this:
> > >
> > > * Kernel "binding" (selecting a kernel given input types) is a bit ad
> > > hoc. Many kernels do not have the ability to tell you up front what
> > > type of data will be returned (this is very important in the context
> > > of a query engine runtime)
> >
> > Another observation is that many of our kernels are unable to write
> > into preallocated memory. This is also a design issue that must be
> > addressed (kernels having the same output type invoked in succession
> > should be able to elide temporary allocations by writing into the same
> > output memory)
> >
> > > * It's unclear that having a large collection of
> > > $FUNC(FunctionContext*, Datum $arg0, ..., Datum* out) is sustainable.
> > >
> > > Rather, it might be better to have something like:
> > >
> > > // We know the "schema" of each argument but don't have any data yet
> > > std::string kernel_name = "$name";
> > > auto bound_kernel = GetKernel(kernel_name, {arg0_shape, arg1_shape},
> options);
> > > auto out = bound_kernel->Call({arg0, arg1});
> > >
> > > So here it would be
> > >
> > > ARROW_ASSIGN_OR_RAISE(auto take_kernel,
> > >                       GetKernel("take", {shapes::chunked_array(int8()),
> > >
> shapes::chunked_array(int8())}));
> > >
> > > So now take_kernel->shape() should tell you that the expected output
> > > is shapes::chunked_array(int8())
> > >
> > > And Call should preferably accept reference arguments for its input
> parameters.
> > >
> > > As far as kernel-specific options, we could create a variant that
> > > includes the different kernel-specific options structures, so that
> > > it's not necessary to have different dispatch functions for different
> > > kernels.
> > >
> > > Anyway, just some out loud thoughts, but while we are in this
> > > intermediate experimental state I'm supportive of you making the
> > > decision that is most convenient for the immediate problem you want to
> > > solve with the caveat that things may be refactored significantly in
> > > the proximate future.
> > >
> > > - Wes
> > >
> > > On Tue, Apr 7, 2020 at 10:42 AM Uwe L. Korn <uw...@xhochy.com>
> wrote:
> > > >
> > > > I did a bit more research on JIRA and we seem to have this open
> topic there also in https://issues.apache.org/jira/browse/ARROW-6959
> which is the similar topic as my mail is about and in
> https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some
> of the interfaces with reference-types.
> > > >
> > > > On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote:
> > > > > Hello all,
> > > > >
> > > > > I'm in the progress of changing the implementation of the Take
> kernel
> > > > > to work on ChunkedArrays without concatenating them into a single
> Array
> > > > > first. While working on the implementation, I realised that we
> switch
> > > > > often between Datum and the specific-typed parameters. This works
> quite
> > > > > well for the combination of Array& and
> Datum(shared_ptr<ArrayData>) as
> > > > > here the reference object with type Array& always carries a shared
> > > > > reference with it, so switching between Array& and its Datum is
> quite
> > > > > easy.
> > > > >
> > > > > In contrast, we cannot do this with ChunkedArrays as here the Datum
> > > > > requires a shared_ptr<ChunkedArray> which cannot be constructed
> from
> > > > > the reference type. Thus to allow interfaces like `Status
> > > > > Take(FunctionContext* ctx, const ChunkedArray& values, const Array&
> > > > > indices,` to pass successfully their arguments to the Kernel
> > > > > implementation, we have to do:
> > > > >
> > > > > a) Remove the references from the interface of the Take() function
> and
> > > > > use `shared_ptr` instances everywhere.
> > > > > b) Add interfaces to kernels like the TakeKernel that allow calling
> > > > > with specific references instead of Datum instances
> > > > >
> > > > > Personally I would prefer b) as this allow us to make more use of
> the
> > > > > C++ type system and would also avoid the shared_ptr overhead where
> not
> > > > > necessary.
> > > > >
> > > > > Cheers,
> > > > > Uwe
> > > > >
> >
>
>
>
>

Re: [C++] Compute: Datum and "ChunkedArray&" inputs

Posted by Antoine Pitrou <so...@pitrou.net>.
It seems there are two different concerns here:
- the kernels' public API
- the ease with which kernels can be implemented.

If we need a different public API, then IMO we should change it sooner
rather than later.

As for the implementation, perhaps we should start by drawing
the main categories of kernels.  For example:
- item-wise kernels (and/or vector-wise)
- aggregate kernels

Some kernels will not fall in these categories (e.g. filter, join...),
but many of them do.

Regards

Antoine.


On Wed, 8 Apr 2020 16:04:30 -0500
Wes McKinney <we...@gmail.com> wrote:
> On Wed, Apr 8, 2020 at 4:01 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > Another idea would be to have a variant with const-references instead
> > of shared_ptr. One potential issue with our Datum is that it plays the
> > dual role of transporting both input and output arguments. With
> > outputs it's necessary to be able to convey ownership while with
> > inputs this is less important.
> >
> > In general, I would say that some additional thought is needed about
> > our kernel APIs. Some scattered thoughts about this:
> >
> > * Kernel "binding" (selecting a kernel given input types) is a bit ad
> > hoc. Many kernels do not have the ability to tell you up front what
> > type of data will be returned (this is very important in the context
> > of a query engine runtime)  
> 
> Another observation is that many of our kernels are unable to write
> into preallocated memory. This is also a design issue that must be
> addressed (kernels having the same output type invoked in succession
> should be able to elide temporary allocations by writing into the same
> output memory)
> 
> > * It's unclear that having a large collection of
> > $FUNC(FunctionContext*, Datum $arg0, ..., Datum* out) is sustainable.
> >
> > Rather, it might be better to have something like:
> >
> > // We know the "schema" of each argument but don't have any data yet
> > std::string kernel_name = "$name";
> > auto bound_kernel = GetKernel(kernel_name, {arg0_shape, arg1_shape}, options);
> > auto out = bound_kernel->Call({arg0, arg1});
> >
> > So here it would be
> >
> > ARROW_ASSIGN_OR_RAISE(auto take_kernel,
> >                       GetKernel("take", {shapes::chunked_array(int8()),
> >                                          shapes::chunked_array(int8())}));
> >
> > So now take_kernel->shape() should tell you that the expected output
> > is shapes::chunked_array(int8())
> >
> > And Call should preferably accept reference arguments for its input parameters.
> >
> > As far as kernel-specific options, we could create a variant that
> > includes the different kernel-specific options structures, so that
> > it's not necessary to have different dispatch functions for different
> > kernels.
> >
> > Anyway, just some out loud thoughts, but while we are in this
> > intermediate experimental state I'm supportive of you making the
> > decision that is most convenient for the immediate problem you want to
> > solve with the caveat that things may be refactored significantly in
> > the proximate future.
> >
> > - Wes
> >
> > On Tue, Apr 7, 2020 at 10:42 AM Uwe L. Korn <uw...@xhochy.com> wrote:  
> > >
> > > I did a bit more research on JIRA and we seem to have this open topic there also in https://issues.apache.org/jira/browse/ARROW-6959 which is the similar topic as my mail is about and in https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some of the interfaces with reference-types.
> > >
> > > On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote:  
> > > > Hello all,
> > > >
> > > > I'm in the progress of changing the implementation of the Take kernel
> > > > to work on ChunkedArrays without concatenating them into a single Array
> > > > first. While working on the implementation, I realised that we switch
> > > > often between Datum and the specific-typed parameters. This works quite
> > > > well for the combination of Array& and Datum(shared_ptr<ArrayData>) as
> > > > here the reference object with type Array& always carries a shared
> > > > reference with it, so switching between Array& and its Datum is quite
> > > > easy.
> > > >
> > > > In contrast, we cannot do this with ChunkedArrays as here the Datum
> > > > requires a shared_ptr<ChunkedArray> which cannot be constructed from
> > > > the reference type. Thus to allow interfaces like `Status
> > > > Take(FunctionContext* ctx, const ChunkedArray& values, const Array&
> > > > indices,` to pass successfully their arguments to the Kernel
> > > > implementation, we have to do:
> > > >
> > > > a) Remove the references from the interface of the Take() function and
> > > > use `shared_ptr` instances everywhere.
> > > > b) Add interfaces to kernels like the TakeKernel that allow calling
> > > > with specific references instead of Datum instances
> > > >
> > > > Personally I would prefer b) as this allow us to make more use of the
> > > > C++ type system and would also avoid the shared_ptr overhead where not
> > > > necessary.
> > > >
> > > > Cheers,
> > > > Uwe
> > > >  
> 




Re: [C++] Compute: Datum and "ChunkedArray&" inputs

Posted by Wes McKinney <we...@gmail.com>.
On Wed, Apr 8, 2020 at 4:01 PM Wes McKinney <we...@gmail.com> wrote:
>
> Another idea would be to have a variant with const-references instead
> of shared_ptr. One potential issue with our Datum is that it plays the
> dual role of transporting both input and output arguments. With
> outputs it's necessary to be able to convey ownership while with
> inputs this is less important.
>
> In general, I would say that some additional thought is needed about
> our kernel APIs. Some scattered thoughts about this:
>
> * Kernel "binding" (selecting a kernel given input types) is a bit ad
> hoc. Many kernels do not have the ability to tell you up front what
> type of data will be returned (this is very important in the context
> of a query engine runtime)

Another observation is that many of our kernels are unable to write
into preallocated memory. This is also a design issue that must be
addressed (kernels having the same output type invoked in succession
should be able to elide temporary allocations by writing into the same
output memory)

> * It's unclear that having a large collection of
> $FUNC(FunctionContext*, Datum $arg0, ..., Datum* out) is sustainable.
>
> Rather, it might be better to have something like:
>
> // We know the "schema" of each argument but don't have any data yet
> std::string kernel_name = "$name";
> auto bound_kernel = GetKernel(kernel_name, {arg0_shape, arg1_shape}, options);
> auto out = bound_kernel->Call({arg0, arg1});
>
> So here it would be
>
> ARROW_ASSIGN_OR_RAISE(auto take_kernel,
>                       GetKernel("take", {shapes::chunked_array(int8()),
>                                          shapes::chunked_array(int8())}));
>
> So now take_kernel->shape() should tell you that the expected output
> is shapes::chunked_array(int8())
>
> And Call should preferably accept reference arguments for its input parameters.
>
> As far as kernel-specific options, we could create a variant that
> includes the different kernel-specific options structures, so that
> it's not necessary to have different dispatch functions for different
> kernels.
>
> Anyway, just some out loud thoughts, but while we are in this
> intermediate experimental state I'm supportive of you making the
> decision that is most convenient for the immediate problem you want to
> solve with the caveat that things may be refactored significantly in
> the proximate future.
>
> - Wes
>
> On Tue, Apr 7, 2020 at 10:42 AM Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> > I did a bit more research on JIRA and we seem to have this open topic there also in https://issues.apache.org/jira/browse/ARROW-6959 which is the similar topic as my mail is about and in https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some of the interfaces with reference-types.
> >
> > On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote:
> > > Hello all,
> > >
> > > I'm in the progress of changing the implementation of the Take kernel
> > > to work on ChunkedArrays without concatenating them into a single Array
> > > first. While working on the implementation, I realised that we switch
> > > often between Datum and the specific-typed parameters. This works quite
> > > well for the combination of Array& and Datum(shared_ptr<ArrayData>) as
> > > here the reference object with type Array& always carries a shared
> > > reference with it, so switching between Array& and its Datum is quite
> > > easy.
> > >
> > > In contrast, we cannot do this with ChunkedArrays as here the Datum
> > > requires a shared_ptr<ChunkedArray> which cannot be constructed from
> > > the reference type. Thus to allow interfaces like `Status
> > > Take(FunctionContext* ctx, const ChunkedArray& values, const Array&
> > > indices,` to pass successfully their arguments to the Kernel
> > > implementation, we have to do:
> > >
> > > a) Remove the references from the interface of the Take() function and
> > > use `shared_ptr` instances everywhere.
> > > b) Add interfaces to kernels like the TakeKernel that allow calling
> > > with specific references instead of Datum instances
> > >
> > > Personally I would prefer b) as this allow us to make more use of the
> > > C++ type system and would also avoid the shared_ptr overhead where not
> > > necessary.
> > >
> > > Cheers,
> > > Uwe
> > >

Re: [C++] Compute: Datum and "ChunkedArray&" inputs

Posted by Wes McKinney <we...@gmail.com>.
Another idea would be to have a variant with const-references instead
of shared_ptr. One potential issue with our Datum is that it plays the
dual role of transporting both input and output arguments. With
outputs it's necessary to be able to convey ownership while with
inputs this is less important.

In general, I would say that some additional thought is needed about
our kernel APIs. Some scattered thoughts about this:

* Kernel "binding" (selecting a kernel given input types) is a bit ad
hoc. Many kernels do not have the ability to tell you up front what
type of data will be returned (this is very important in the context
of a query engine runtime)
* It's unclear that having a large collection of
$FUNC(FunctionContext*, Datum $arg0, ..., Datum* out) is sustainable.

Rather, it might be better to have something like:

// We know the "schema" of each argument but don't have any data yet
std::string kernel_name = "$name";
auto bound_kernel = GetKernel(kernel_name, {arg0_shape, arg1_shape}, options);
auto out = bound_kernel->Call({arg0, arg1});

So here it would be

ARROW_ASSIGN_OR_RAISE(auto take_kernel,
                      GetKernel("take", {shapes::chunked_array(int8()),
                                         shapes::chunked_array(int8())}));

So now take_kernel->shape() should tell you that the expected output
is shapes::chunked_array(int8())

And Call should preferably accept reference arguments for its input parameters.

As far as kernel-specific options, we could create a variant that
includes the different kernel-specific options structures, so that
it's not necessary to have different dispatch functions for different
kernels.

Anyway, just some out loud thoughts, but while we are in this
intermediate experimental state I'm supportive of you making the
decision that is most convenient for the immediate problem you want to
solve with the caveat that things may be refactored significantly in
the proximate future.

- Wes

On Tue, Apr 7, 2020 at 10:42 AM Uwe L. Korn <uw...@xhochy.com> wrote:
>
> I did a bit more research on JIRA and we seem to have this open topic there also in https://issues.apache.org/jira/browse/ARROW-6959 which is the similar topic as my mail is about and in https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some of the interfaces with reference-types.
>
> On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote:
> > Hello all,
> >
> > I'm in the progress of changing the implementation of the Take kernel
> > to work on ChunkedArrays without concatenating them into a single Array
> > first. While working on the implementation, I realised that we switch
> > often between Datum and the specific-typed parameters. This works quite
> > well for the combination of Array& and Datum(shared_ptr<ArrayData>) as
> > here the reference object with type Array& always carries a shared
> > reference with it, so switching between Array& and its Datum is quite
> > easy.
> >
> > In contrast, we cannot do this with ChunkedArrays as here the Datum
> > requires a shared_ptr<ChunkedArray> which cannot be constructed from
> > the reference type. Thus to allow interfaces like `Status
> > Take(FunctionContext* ctx, const ChunkedArray& values, const Array&
> > indices,` to pass successfully their arguments to the Kernel
> > implementation, we have to do:
> >
> > a) Remove the references from the interface of the Take() function and
> > use `shared_ptr` instances everywhere.
> > b) Add interfaces to kernels like the TakeKernel that allow calling
> > with specific references instead of Datum instances
> >
> > Personally I would prefer b) as this allow us to make more use of the
> > C++ type system and would also avoid the shared_ptr overhead where not
> > necessary.
> >
> > Cheers,
> > Uwe
> >

Re: [C++] Compute: Datum and "ChunkedArray&" inputs

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
I did a bit more research on JIRA and we seem to have this open topic there also in https://issues.apache.org/jira/browse/ARROW-6959 which is the similar topic as my mail is about and in https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some of the interfaces with reference-types.

On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote:
> Hello all,
> 
> I'm in the progress of changing the implementation of the Take kernel 
> to work on ChunkedArrays without concatenating them into a single Array 
> first. While working on the implementation, I realised that we switch 
> often between Datum and the specific-typed parameters. This works quite 
> well for the combination of Array& and Datum(shared_ptr<ArrayData>) as 
> here the reference object with type Array& always carries a shared 
> reference with it, so switching between Array& and its Datum is quite 
> easy.
> 
> In contrast, we cannot do this with ChunkedArrays as here the Datum 
> requires a shared_ptr<ChunkedArray> which cannot be constructed from 
> the reference type. Thus to allow interfaces like `Status 
> Take(FunctionContext* ctx, const ChunkedArray& values, const Array& 
> indices,` to pass successfully their arguments to the Kernel 
> implementation, we have to do:
> 
> a) Remove the references from the interface of the Take() function and 
> use `shared_ptr` instances everywhere.
> b) Add interfaces to kernels like the TakeKernel that allow calling 
> with specific references instead of Datum instances
> 
> Personally I would prefer b) as this allow us to make more use of the 
> C++ type system and would also avoid the shared_ptr overhead where not 
> necessary.
> 
> Cheers,
> Uwe
>