You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Ian Cook <ia...@ursacomputing.com> on 2021/06/17 22:58:44 UTC

[C++] Apache Arrow C++ Variadic Kernels Design

Arrow developers,

A couple of recent PRs have added new variadic scalar kernels to the
Arrow C++ library (ARROW-12751, ARROW-12709). There were some
questions raised in comments on Jira and GitHub about whether these
could instead be implemented as unary or binary kernels that take
ListArray or StructArray input. Since I believe we plan to add at
least a few more variadic kernels, I wrote a document [1] with help
from some colleagues at Ursa to describe the rationale behind why we
believe it is best to implement these as variadic kernels. Feedback is
welcome.

Thank you,
Ian

[1] https://docs.google.com/document/d/1ExysJ43WpjZ_P6vnfx6dzCRSvM-3qlqpc6gPjy9cNXM/

Re: [C++] Apache Arrow C++ Variadic Kernels Design

Posted by Wes McKinney <we...@gmail.com>.

hi Ian — I agree with implementing these functions with
varargs/variadic inputs (this was my original intent when drafting
compute/kernel.h and related machinery last year).

As one nuance with the way that things work right now, the type
matching infrastructure isn't necessarily able to determine whether
varargs inputs are compatible with each other — the type matching rule
considers each argument independently

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernel.cc#L411

There may be other places to address this type correspondence, but if
it were deemed useful to better support variadic argument validation,
we could probably fairly easily generalize the TypeMatcher API so that
it can "see" the types of the other arguments.

Thanks,
Wes

On Thu, Jun 17, 2021 at 5:59 PM Ian Cook <ia...@ursacomputing.com> wrote:
>
> Arrow developers,
>
> A couple of recent PRs have added new variadic scalar kernels to the
> Arrow C++ library (ARROW-12751, ARROW-12709). There were some
> questions raised in comments on Jira and GitHub about whether these
> could instead be implemented as unary or binary kernels that take
> ListArray or StructArray input. Since I believe we plan to add at
> least a few more variadic kernels, I wrote a document [1] with help
> from some colleagues at Ursa to describe the rationale behind why we
> believe it is best to implement these as variadic kernels. Feedback is
> welcome.
>
> Thank you,
> Ian
>
> [1] https://docs.google.com/document/d/1ExysJ43WpjZ_P6vnfx6dzCRSvM-3qlqpc6gPjy9cNXM/

Re: [C++] Apache Arrow C++ Variadic Kernels Design

Posted by Wes McKinney <we...@gmail.com>.

COUNT(DISTINCT varargs...) can be used either as a scalar aggregate
function or a group aggregate function. For example

SELECT COUNT(DISTINCT expr1, expr2, ...)
FROM TABLE;

returns a single value. It can be used with GROUP BY to produce a
distinct count per group. I think it would be useful to have available
as a scalar aggregate function. Either way good to know that our
aggregation exprs will need to support varargs

SELECT DISTINCT is equivalent to our Unique. So one implementation of

SELECT DISTINCT expr1, expr2, ...
FROM TABLE;

could be implemented by internally grouping the exprs into a
StructArray and calling Unique on a struct array. We could also simply
call the aggregation machinery with no aggregate exprs.

Might want to make some Jira issues for the above if there are not already.

On Fri, Jun 18, 2021 at 4:37 PM Ian Cook <ia...@ursacomputing.com> wrote:
>
> > Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a GROUP BY query? Do they need to be exposed as standalone kernels?
>
> I listed SELECT DISTINCT and COUNT DISTINCT in the document only as
> examples of SQL statements that take a variable number of arguments,
> not to imply that these should be exposed as compute kernels in Arrow.
> But I think you are right to suggest that they do not really belong in
> this list, because as you say it is probably best to think of them as
> shortcut SQL syntax for obtaining results that could instead be
> obtained through a GROUP BY query. I have removed them.
>
> Thank you,
> Ian
>
> On Fri, Jun 18, 2021 at 2:26 PM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a
> > GROUP BY query? Do they need to be exposed as standalone kernels?
> >
> >
> > Le 18/06/2021 à 00:58, Ian Cook a écrit :
> > > Arrow developers,
> > >
> > > A couple of recent PRs have added new variadic scalar kernels to the
> > > Arrow C++ library (ARROW-12751, ARROW-12709). There were some
> > > questions raised in comments on Jira and GitHub about whether these
> > > could instead be implemented as unary or binary kernels that take
> > > ListArray or StructArray input. Since I believe we plan to add at
> > > least a few more variadic kernels, I wrote a document [1] with help
> > > from some colleagues at Ursa to describe the rationale behind why we
> > > believe it is best to implement these as variadic kernels. Feedback is
> > > welcome.
> > >
> > > Thank you,
> > > Ian
> > >
> > > [1] https://docs.google.com/document/d/1ExysJ43WpjZ_P6vnfx6dzCRSvM-3qlqpc6gPjy9cNXM/
> > >

Re: [C++] Apache Arrow C++ Variadic Kernels Design

Posted by Ian Cook <ia...@ursacomputing.com>.

> Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a GROUP BY query? Do they need to be exposed as standalone kernels?

I listed SELECT DISTINCT and COUNT DISTINCT in the document only as
examples of SQL statements that take a variable number of arguments,
not to imply that these should be exposed as compute kernels in Arrow.
But I think you are right to suggest that they do not really belong in
this list, because as you say it is probably best to think of them as
shortcut SQL syntax for obtaining results that could instead be
obtained through a GROUP BY query. I have removed them.

Thank you,
Ian

On Fri, Jun 18, 2021 at 2:26 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a
> GROUP BY query? Do they need to be exposed as standalone kernels?
>
>
> Le 18/06/2021 à 00:58, Ian Cook a écrit :
> > Arrow developers,
> >
> > A couple of recent PRs have added new variadic scalar kernels to the
> > Arrow C++ library (ARROW-12751, ARROW-12709). There were some
> > questions raised in comments on Jira and GitHub about whether these
> > could instead be implemented as unary or binary kernels that take
> > ListArray or StructArray input. Since I believe we plan to add at
> > least a few more variadic kernels, I wrote a document [1] with help
> > from some colleagues at Ursa to describe the rationale behind why we
> > believe it is best to implement these as variadic kernels. Feedback is
> > welcome.
> >
> > Thank you,
> > Ian
> >
> > [1] https://docs.google.com/document/d/1ExysJ43WpjZ_P6vnfx6dzCRSvM-3qlqpc6gPjy9cNXM/
> >

Re: [C++] Apache Arrow C++ Variadic Kernels Design

Posted by Antoine Pitrou <an...@python.org>.

Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a 
GROUP BY query? Do they need to be exposed as standalone kernels?


Le 18/06/2021 à 00:58, Ian Cook a écrit :
> Arrow developers,
> 
> A couple of recent PRs have added new variadic scalar kernels to the
> Arrow C++ library (ARROW-12751, ARROW-12709). There were some
> questions raised in comments on Jira and GitHub about whether these
> could instead be implemented as unary or binary kernels that take
> ListArray or StructArray input. Since I believe we plan to add at
> least a few more variadic kernels, I wrote a document [1] with help
> from some colleagues at Ursa to describe the rationale behind why we
> believe it is best to implement these as variadic kernels. Feedback is
> welcome.
> 
> Thank you,
> Ian
> 
> [1] https://docs.google.com/document/d/1ExysJ43WpjZ_P6vnfx6dzCRSvM-3qlqpc6gPjy9cNXM/
>