You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2020/05/11 22:17:37 UTC

Re: [C++] Revamping approach to Arrow compute kernel development

I'm working actively on this but perhaps as expected it has ballooned
into a very large project -- it's unclear at the moment whether I'll
be able to break the work into smaller patches that are easier to
digest. I'm working as fast as I can to have an initial
feature-preserving PR up, but the changes to the src/arrow/compute
directory are extensive, so I would please ask that we do not merge
non-essential patches into cpp/src/arrow/compute until I get the PR up
(estimated later this week at present rate) so we can see where things
stand.

On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <we...@gmail.com> wrote:
>
> On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <em...@gmail.com> wrote:
> >
> > Hi Wes,
> > I haven't had time to read the doc, but wanted to ask some questions on
> > points raised on the thread.
> >
> > * For efficiency, kernels used for array-expr evaluation should write
> > > into preallocated memory as their default mode. This enables the
> > > interpreter to avoid temporary memory allocations and improve CPU
> > > cache utilization. Almost none of our kernels are implemented this way
> > > currently.
> >
> > Did something change, I was pretty sure I submitted a patch a while ago for
> > boolean kernels, that separated out memory allocation from computation.
> > Which should allow for writing to the same memory.  Is this a concern with
> > the public Function APIs for the Kernel APIs themselves, or a lower level
> > implementation concern?
>
> Yes, you did in the internal implementation [1]. The concern is the
> public API and the general approach to implementing new kernels.
>
> I'm working on this right now (it's a large project so it will take me
> a little while to produce something to be reviewed) so bear with me =)
>
> [1]: https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76
>
> > * Sorting is generally handled by different data processing nodes from
> > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > > Projections and Filters use expressions, they do not sort.
> >
> > Would sorting the list-column elements per row be an array-expr?
>
> Yes, as that's an element-wise function. When I said sorting I was
> referring to ORDER BY. The functions we have that do sorting do so in
> the context of a single array [2].
>
> A query engine must be able to sort a (potentially very large) stream
> of record batches. One approach is for the Sort operator to exhaust
> its child input, accumulating all of the record batches in memory
> (spilling to disk as needed) and then sorting and emitting record
> batches from the sorted records/tuples. See e.g. Impala's sorting code
> [3] [4]
>
> [2]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34
> [3]: https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h
> [4]: https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h
>
> >
> > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <an...@python.org> wrote:
> > > >
> > > >
> > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
> > > > >>
> > > > >> That said, in the SortToIndices case, this wouldn't be a problem,
> > > since
> > > > >> only the second pass writes to the output.
> > > > >
> > > > > This kernel is not valid for normal array-exprs (see the spreadsheet I
> > > > > linked), such as what you can write in SQL
> > > > >
> > > > > Kernels like SortToIndices are a different type of function (in other
> > > > > words, "not a SQL function") and so if we choose to allow such a
> > > > > "non-SQL-like" functions in the expression evaluator then different
> > > > > logic must be used.
> > > >
> > > > Hmm, I think that maybe I'm misunderstanding at which level we're
> > > > talking here.  SortToIndices() may not be a "SQL function", but it looks
> > > > like an important basic block for a query engine (since, after all,
> > > > sorting results is an often used feature in SQL and other languages).
> > > > So it should be usable *inside* the expression engine, even though it's
> > > > not part of the exposed vocabulary, no?
> > >
> > > No, not as part of "expressions" as they are defined in the context of
> > > SQL engines.
> > >
> > > Sorting is generally handled by different data processing nodes from
> > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > > Projections and Filters use expressions, they do not sort.
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > >

Re: [C++] Revamping approach to Arrow compute kernel development

Posted by Wes McKinney <we...@gmail.com>.
I have spent some time going through the JIRA backlog and have
organized an umbrella JIRA with about 75 issues under it to help
organize building out further compute kernels and kernel execution
functionality:

https://issues.apache.org/jira/browse/ARROW-8894

On Sun, May 24, 2020 at 9:36 AM Wes McKinney <we...@gmail.com> wrote:
>
> I have merged the patch but left the PR open for additional code review.
>
> On Sat, May 23, 2020 at 3:24 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > To be clear given the scope of code affected I think we should merge it today and address further feedback in a follow up patch. I will be diligent about responding to additional comments in the PR
> >
> > On Sat, May 23, 2020, 3:19 PM Wes McKinney <we...@gmail.com> wrote:
> >>
> >> Yes you should still be able to comment. I will reopen the PR after it is merged
> >>
> >> On Sat, May 23, 2020, 2:52 PM Micah Kornfield <em...@gmail.com> wrote:
> >>>
> >>> Hi Wes,
> >>> Will we still be able to comment on the PR once it is closed?
> >>>
> >>>
> >>> If we want to be inclusive on feedback it might pay to wait until Tuesday evening US time to merge since it is a long weekend here.
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>> On Saturday, May 23, 2020, Wes McKinney <we...@gmail.com> wrote:
> >>>>
> >>>> Hi folks -- I've addressed a good deal of feedback and added a lot of
> >>>> comments and with Kou's help have got the build passing, It would be
> >>>> great if this could be merged soon to unblock follow up PRs
> >>>>
> >>>> On Wed, May 20, 2020 at 11:55 PM Wes McKinney <we...@gmail.com> wrote:
> >>>> >
> >>>> > I just opened the PR https://github.com/apache/arrow/pull/7240
> >>>> >
> >>>> > I'm sorry it's so big. I really think this is the best way. The only
> >>>> > further work I plan to do on it is to get the CI passing.
> >>>> >
> >>>> > On Wed, May 20, 2020 at 12:26 PM Wes McKinney <we...@gmail.com> wrote:
> >>>> > >
> >>>> > > I'd guess I'm < 24 hours away from putting up my initial PR for this
> >>>> > > work. Since the work is large and (for all practical purposes) nearly
> >>>> > > impossible to separate into separately merge-ready PRs, I'll start a
> >>>> > > new e-mail thread describing what I've done in more detail and
> >>>> > > proposing a path for merging the PR (so as to unblock PRs into
> >>>> > > arrow/compute and avoid rebasing headaches) and addressing review
> >>>> > > feedback that will likely take several weeks to fully accumulate.
> >>>> > >
> >>>> > > On Mon, May 11, 2020 at 5:17 PM Wes McKinney <we...@gmail.com> wrote:
> >>>> > > >
> >>>> > > > I'm working actively on this but perhaps as expected it has ballooned
> >>>> > > > into a very large project -- it's unclear at the moment whether I'll
> >>>> > > > be able to break the work into smaller patches that are easier to
> >>>> > > > digest. I'm working as fast as I can to have an initial
> >>>> > > > feature-preserving PR up, but the changes to the src/arrow/compute
> >>>> > > > directory are extensive, so I would please ask that we do not merge
> >>>> > > > non-essential patches into cpp/src/arrow/compute until I get the PR up
> >>>> > > > (estimated later this week at present rate) so we can see where things
> >>>> > > > stand.
> >>>> > > >
> >>>> > > > On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <we...@gmail.com> wrote:
> >>>> > > > >
> >>>> > > > > On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <em...@gmail.com> wrote:
> >>>> > > > > >
> >>>> > > > > > Hi Wes,
> >>>> > > > > > I haven't had time to read the doc, but wanted to ask some questions on
> >>>> > > > > > points raised on the thread.
> >>>> > > > > >
> >>>> > > > > > * For efficiency, kernels used for array-expr evaluation should write
> >>>> > > > > > > into preallocated memory as their default mode. This enables the
> >>>> > > > > > > interpreter to avoid temporary memory allocations and improve CPU
> >>>> > > > > > > cache utilization. Almost none of our kernels are implemented this way
> >>>> > > > > > > currently.
> >>>> > > > > >
> >>>> > > > > > Did something change, I was pretty sure I submitted a patch a while ago for
> >>>> > > > > > boolean kernels, that separated out memory allocation from computation.
> >>>> > > > > > Which should allow for writing to the same memory.  Is this a concern with
> >>>> > > > > > the public Function APIs for the Kernel APIs themselves, or a lower level
> >>>> > > > > > implementation concern?
> >>>> > > > >
> >>>> > > > > Yes, you did in the internal implementation [1]. The concern is the
> >>>> > > > > public API and the general approach to implementing new kernels.
> >>>> > > > >
> >>>> > > > > I'm working on this right now (it's a large project so it will take me
> >>>> > > > > a little while to produce something to be reviewed) so bear with me =)
> >>>> > > > >
> >>>> > > > > [1]: https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76
> >>>> > > > >
> >>>> > > > > > * Sorting is generally handled by different data processing nodes from
> >>>> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> >>>> > > > > > > Projections and Filters use expressions, they do not sort.
> >>>> > > > > >
> >>>> > > > > > Would sorting the list-column elements per row be an array-expr?
> >>>> > > > >
> >>>> > > > > Yes, as that's an element-wise function. When I said sorting I was
> >>>> > > > > referring to ORDER BY. The functions we have that do sorting do so in
> >>>> > > > > the context of a single array [2].
> >>>> > > > >
> >>>> > > > > A query engine must be able to sort a (potentially very large) stream
> >>>> > > > > of record batches. One approach is for the Sort operator to exhaust
> >>>> > > > > its child input, accumulating all of the record batches in memory
> >>>> > > > > (spilling to disk as needed) and then sorting and emitting record
> >>>> > > > > batches from the sorted records/tuples. See e.g. Impala's sorting code
> >>>> > > > > [3] [4]
> >>>> > > > >
> >>>> > > > > [2]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34
> >>>> > > > > [3]: https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h
> >>>> > > > > [4]: https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h
> >>>> > > > >
> >>>> > > > > >
> >>>> > > > > > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <we...@gmail.com> wrote:
> >>>> > > > > >
> >>>> > > > > > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <an...@python.org> wrote:
> >>>> > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
> >>>> > > > > > > > >>
> >>>> > > > > > > > >> That said, in the SortToIndices case, this wouldn't be a problem,
> >>>> > > > > > > since
> >>>> > > > > > > > >> only the second pass writes to the output.
> >>>> > > > > > > > >
> >>>> > > > > > > > > This kernel is not valid for normal array-exprs (see the spreadsheet I
> >>>> > > > > > > > > linked), such as what you can write in SQL
> >>>> > > > > > > > >
> >>>> > > > > > > > > Kernels like SortToIndices are a different type of function (in other
> >>>> > > > > > > > > words, "not a SQL function") and so if we choose to allow such a
> >>>> > > > > > > > > "non-SQL-like" functions in the expression evaluator then different
> >>>> > > > > > > > > logic must be used.
> >>>> > > > > > > >
> >>>> > > > > > > > Hmm, I think that maybe I'm misunderstanding at which level we're
> >>>> > > > > > > > talking here.  SortToIndices() may not be a "SQL function", but it looks
> >>>> > > > > > > > like an important basic block for a query engine (since, after all,
> >>>> > > > > > > > sorting results is an often used feature in SQL and other languages).
> >>>> > > > > > > > So it should be usable *inside* the expression engine, even though it's
> >>>> > > > > > > > not part of the exposed vocabulary, no?
> >>>> > > > > > >
> >>>> > > > > > > No, not as part of "expressions" as they are defined in the context of
> >>>> > > > > > > SQL engines.
> >>>> > > > > > >
> >>>> > > > > > > Sorting is generally handled by different data processing nodes from
> >>>> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> >>>> > > > > > > Projections and Filters use expressions, they do not sort.
> >>>> > > > > > >
> >>>> > > > > > > > Regards
> >>>> > > > > > > >
> >>>> > > > > > > > Antoine.
> >>>> > > > > > >

Re: [C++] Revamping approach to Arrow compute kernel development

Posted by Wes McKinney <we...@gmail.com>.
I have merged the patch but left the PR open for additional code review.

On Sat, May 23, 2020 at 3:24 PM Wes McKinney <we...@gmail.com> wrote:
>
> To be clear given the scope of code affected I think we should merge it today and address further feedback in a follow up patch. I will be diligent about responding to additional comments in the PR
>
> On Sat, May 23, 2020, 3:19 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> Yes you should still be able to comment. I will reopen the PR after it is merged
>>
>> On Sat, May 23, 2020, 2:52 PM Micah Kornfield <em...@gmail.com> wrote:
>>>
>>> Hi Wes,
>>> Will we still be able to comment on the PR once it is closed?
>>>
>>>
>>> If we want to be inclusive on feedback it might pay to wait until Tuesday evening US time to merge since it is a long weekend here.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Saturday, May 23, 2020, Wes McKinney <we...@gmail.com> wrote:
>>>>
>>>> Hi folks -- I've addressed a good deal of feedback and added a lot of
>>>> comments and with Kou's help have got the build passing, It would be
>>>> great if this could be merged soon to unblock follow up PRs
>>>>
>>>> On Wed, May 20, 2020 at 11:55 PM Wes McKinney <we...@gmail.com> wrote:
>>>> >
>>>> > I just opened the PR https://github.com/apache/arrow/pull/7240
>>>> >
>>>> > I'm sorry it's so big. I really think this is the best way. The only
>>>> > further work I plan to do on it is to get the CI passing.
>>>> >
>>>> > On Wed, May 20, 2020 at 12:26 PM Wes McKinney <we...@gmail.com> wrote:
>>>> > >
>>>> > > I'd guess I'm < 24 hours away from putting up my initial PR for this
>>>> > > work. Since the work is large and (for all practical purposes) nearly
>>>> > > impossible to separate into separately merge-ready PRs, I'll start a
>>>> > > new e-mail thread describing what I've done in more detail and
>>>> > > proposing a path for merging the PR (so as to unblock PRs into
>>>> > > arrow/compute and avoid rebasing headaches) and addressing review
>>>> > > feedback that will likely take several weeks to fully accumulate.
>>>> > >
>>>> > > On Mon, May 11, 2020 at 5:17 PM Wes McKinney <we...@gmail.com> wrote:
>>>> > > >
>>>> > > > I'm working actively on this but perhaps as expected it has ballooned
>>>> > > > into a very large project -- it's unclear at the moment whether I'll
>>>> > > > be able to break the work into smaller patches that are easier to
>>>> > > > digest. I'm working as fast as I can to have an initial
>>>> > > > feature-preserving PR up, but the changes to the src/arrow/compute
>>>> > > > directory are extensive, so I would please ask that we do not merge
>>>> > > > non-essential patches into cpp/src/arrow/compute until I get the PR up
>>>> > > > (estimated later this week at present rate) so we can see where things
>>>> > > > stand.
>>>> > > >
>>>> > > > On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <we...@gmail.com> wrote:
>>>> > > > >
>>>> > > > > On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <em...@gmail.com> wrote:
>>>> > > > > >
>>>> > > > > > Hi Wes,
>>>> > > > > > I haven't had time to read the doc, but wanted to ask some questions on
>>>> > > > > > points raised on the thread.
>>>> > > > > >
>>>> > > > > > * For efficiency, kernels used for array-expr evaluation should write
>>>> > > > > > > into preallocated memory as their default mode. This enables the
>>>> > > > > > > interpreter to avoid temporary memory allocations and improve CPU
>>>> > > > > > > cache utilization. Almost none of our kernels are implemented this way
>>>> > > > > > > currently.
>>>> > > > > >
>>>> > > > > > Did something change, I was pretty sure I submitted a patch a while ago for
>>>> > > > > > boolean kernels, that separated out memory allocation from computation.
>>>> > > > > > Which should allow for writing to the same memory.  Is this a concern with
>>>> > > > > > the public Function APIs for the Kernel APIs themselves, or a lower level
>>>> > > > > > implementation concern?
>>>> > > > >
>>>> > > > > Yes, you did in the internal implementation [1]. The concern is the
>>>> > > > > public API and the general approach to implementing new kernels.
>>>> > > > >
>>>> > > > > I'm working on this right now (it's a large project so it will take me
>>>> > > > > a little while to produce something to be reviewed) so bear with me =)
>>>> > > > >
>>>> > > > > [1]: https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76
>>>> > > > >
>>>> > > > > > * Sorting is generally handled by different data processing nodes from
>>>> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
>>>> > > > > > > Projections and Filters use expressions, they do not sort.
>>>> > > > > >
>>>> > > > > > Would sorting the list-column elements per row be an array-expr?
>>>> > > > >
>>>> > > > > Yes, as that's an element-wise function. When I said sorting I was
>>>> > > > > referring to ORDER BY. The functions we have that do sorting do so in
>>>> > > > > the context of a single array [2].
>>>> > > > >
>>>> > > > > A query engine must be able to sort a (potentially very large) stream
>>>> > > > > of record batches. One approach is for the Sort operator to exhaust
>>>> > > > > its child input, accumulating all of the record batches in memory
>>>> > > > > (spilling to disk as needed) and then sorting and emitting record
>>>> > > > > batches from the sorted records/tuples. See e.g. Impala's sorting code
>>>> > > > > [3] [4]
>>>> > > > >
>>>> > > > > [2]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34
>>>> > > > > [3]: https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h
>>>> > > > > [4]: https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h
>>>> > > > >
>>>> > > > > >
>>>> > > > > > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <we...@gmail.com> wrote:
>>>> > > > > >
>>>> > > > > > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <an...@python.org> wrote:
>>>> > > > > > > >
>>>> > > > > > > >
>>>> > > > > > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
>>>> > > > > > > > >>
>>>> > > > > > > > >> That said, in the SortToIndices case, this wouldn't be a problem,
>>>> > > > > > > since
>>>> > > > > > > > >> only the second pass writes to the output.
>>>> > > > > > > > >
>>>> > > > > > > > > This kernel is not valid for normal array-exprs (see the spreadsheet I
>>>> > > > > > > > > linked), such as what you can write in SQL
>>>> > > > > > > > >
>>>> > > > > > > > > Kernels like SortToIndices are a different type of function (in other
>>>> > > > > > > > > words, "not a SQL function") and so if we choose to allow such a
>>>> > > > > > > > > "non-SQL-like" functions in the expression evaluator then different
>>>> > > > > > > > > logic must be used.
>>>> > > > > > > >
>>>> > > > > > > > Hmm, I think that maybe I'm misunderstanding at which level we're
>>>> > > > > > > > talking here.  SortToIndices() may not be a "SQL function", but it looks
>>>> > > > > > > > like an important basic block for a query engine (since, after all,
>>>> > > > > > > > sorting results is an often used feature in SQL and other languages).
>>>> > > > > > > > So it should be usable *inside* the expression engine, even though it's
>>>> > > > > > > > not part of the exposed vocabulary, no?
>>>> > > > > > >
>>>> > > > > > > No, not as part of "expressions" as they are defined in the context of
>>>> > > > > > > SQL engines.
>>>> > > > > > >
>>>> > > > > > > Sorting is generally handled by different data processing nodes from
>>>> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
>>>> > > > > > > Projections and Filters use expressions, they do not sort.
>>>> > > > > > >
>>>> > > > > > > > Regards
>>>> > > > > > > >
>>>> > > > > > > > Antoine.
>>>> > > > > > >

Re: [C++] Revamping approach to Arrow compute kernel development

Posted by Wes McKinney <we...@gmail.com>.
To be clear given the scope of code affected I think we should merge it
today and address further feedback in a follow up patch. I will be diligent
about responding to additional comments in the PR

On Sat, May 23, 2020, 3:19 PM Wes McKinney <we...@gmail.com> wrote:

> Yes you should still be able to comment. I will reopen the PR after it is
> merged
>
> On Sat, May 23, 2020, 2:52 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Wes,
>> Will we still be able to comment on the PR once it is closed?
>>
>>
>> If we want to be inclusive on feedback it might pay to wait until Tuesday
>> evening US time to merge since it is a long weekend here.
>>
>> Thanks,
>> Micah
>>
>> On Saturday, May 23, 2020, Wes McKinney <we...@gmail.com> wrote:
>>
>>> Hi folks -- I've addressed a good deal of feedback and added a lot of
>>> comments and with Kou's help have got the build passing, It would be
>>> great if this could be merged soon to unblock follow up PRs
>>>
>>> On Wed, May 20, 2020 at 11:55 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >
>>> > I just opened the PR https://github.com/apache/arrow/pull/7240
>>> >
>>> > I'm sorry it's so big. I really think this is the best way. The only
>>> > further work I plan to do on it is to get the CI passing.
>>> >
>>> > On Wed, May 20, 2020 at 12:26 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>> > >
>>> > > I'd guess I'm < 24 hours away from putting up my initial PR for this
>>> > > work. Since the work is large and (for all practical purposes) nearly
>>> > > impossible to separate into separately merge-ready PRs, I'll start a
>>> > > new e-mail thread describing what I've done in more detail and
>>> > > proposing a path for merging the PR (so as to unblock PRs into
>>> > > arrow/compute and avoid rebasing headaches) and addressing review
>>> > > feedback that will likely take several weeks to fully accumulate.
>>> > >
>>> > > On Mon, May 11, 2020 at 5:17 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>> > > >
>>> > > > I'm working actively on this but perhaps as expected it has
>>> ballooned
>>> > > > into a very large project -- it's unclear at the moment whether
>>> I'll
>>> > > > be able to break the work into smaller patches that are easier to
>>> > > > digest. I'm working as fast as I can to have an initial
>>> > > > feature-preserving PR up, but the changes to the src/arrow/compute
>>> > > > directory are extensive, so I would please ask that we do not merge
>>> > > > non-essential patches into cpp/src/arrow/compute until I get the
>>> PR up
>>> > > > (estimated later this week at present rate) so we can see where
>>> things
>>> > > > stand.
>>> > > >
>>> > > > On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <we...@gmail.com>
>>> wrote:
>>> > > > >
>>> > > > > On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <
>>> emkornfield@gmail.com> wrote:
>>> > > > > >
>>> > > > > > Hi Wes,
>>> > > > > > I haven't had time to read the doc, but wanted to ask some
>>> questions on
>>> > > > > > points raised on the thread.
>>> > > > > >
>>> > > > > > * For efficiency, kernels used for array-expr evaluation
>>> should write
>>> > > > > > > into preallocated memory as their default mode. This enables
>>> the
>>> > > > > > > interpreter to avoid temporary memory allocations and
>>> improve CPU
>>> > > > > > > cache utilization. Almost none of our kernels are
>>> implemented this way
>>> > > > > > > currently.
>>> > > > > >
>>> > > > > > Did something change, I was pretty sure I submitted a patch a
>>> while ago for
>>> > > > > > boolean kernels, that separated out memory allocation from
>>> computation.
>>> > > > > > Which should allow for writing to the same memory.  Is this a
>>> concern with
>>> > > > > > the public Function APIs for the Kernel APIs themselves, or a
>>> lower level
>>> > > > > > implementation concern?
>>> > > > >
>>> > > > > Yes, you did in the internal implementation [1]. The concern is
>>> the
>>> > > > > public API and the general approach to implementing new kernels.
>>> > > > >
>>> > > > > I'm working on this right now (it's a large project so it will
>>> take me
>>> > > > > a little while to produce something to be reviewed) so bear with
>>> me =)
>>> > > > >
>>> > > > > [1]:
>>> https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76
>>> > > > >
>>> > > > > > * Sorting is generally handled by different data processing
>>> nodes from
>>> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and
>>> Joins.
>>> > > > > > > Projections and Filters use expressions, they do not sort.
>>> > > > > >
>>> > > > > > Would sorting the list-column elements per row be an
>>> array-expr?
>>> > > > >
>>> > > > > Yes, as that's an element-wise function. When I said sorting I
>>> was
>>> > > > > referring to ORDER BY. The functions we have that do sorting do
>>> so in
>>> > > > > the context of a single array [2].
>>> > > > >
>>> > > > > A query engine must be able to sort a (potentially very large)
>>> stream
>>> > > > > of record batches. One approach is for the Sort operator to
>>> exhaust
>>> > > > > its child input, accumulating all of the record batches in memory
>>> > > > > (spilling to disk as needed) and then sorting and emitting record
>>> > > > > batches from the sorted records/tuples. See e.g. Impala's
>>> sorting code
>>> > > > > [3] [4]
>>> > > > >
>>> > > > > [2]:
>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34
>>> > > > > [3]:
>>> https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h
>>> > > > > [4]:
>>> https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h
>>> > > > >
>>> > > > > >
>>> > > > > > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <
>>> wesmckinn@gmail.com> wrote:
>>> > > > > >
>>> > > > > > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <
>>> antoine@python.org> wrote:
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
>>> > > > > > > > >>
>>> > > > > > > > >> That said, in the SortToIndices case, this wouldn't be
>>> a problem,
>>> > > > > > > since
>>> > > > > > > > >> only the second pass writes to the output.
>>> > > > > > > > >
>>> > > > > > > > > This kernel is not valid for normal array-exprs (see the
>>> spreadsheet I
>>> > > > > > > > > linked), such as what you can write in SQL
>>> > > > > > > > >
>>> > > > > > > > > Kernels like SortToIndices are a different type of
>>> function (in other
>>> > > > > > > > > words, "not a SQL function") and so if we choose to
>>> allow such a
>>> > > > > > > > > "non-SQL-like" functions in the expression evaluator
>>> then different
>>> > > > > > > > > logic must be used.
>>> > > > > > > >
>>> > > > > > > > Hmm, I think that maybe I'm misunderstanding at which
>>> level we're
>>> > > > > > > > talking here.  SortToIndices() may not be a "SQL
>>> function", but it looks
>>> > > > > > > > like an important basic block for a query engine (since,
>>> after all,
>>> > > > > > > > sorting results is an often used feature in SQL and other
>>> languages).
>>> > > > > > > > So it should be usable *inside* the expression engine,
>>> even though it's
>>> > > > > > > > not part of the exposed vocabulary, no?
>>> > > > > > >
>>> > > > > > > No, not as part of "expressions" as they are defined in the
>>> context of
>>> > > > > > > SQL engines.
>>> > > > > > >
>>> > > > > > > Sorting is generally handled by different data processing
>>> nodes from
>>> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and
>>> Joins.
>>> > > > > > > Projections and Filters use expressions, they do not sort.
>>> > > > > > >
>>> > > > > > > > Regards
>>> > > > > > > >
>>> > > > > > > > Antoine.
>>> > > > > > >
>>>
>>

Re: [C++] Revamping approach to Arrow compute kernel development

Posted by Wes McKinney <we...@gmail.com>.
Yes you should still be able to comment. I will reopen the PR after it is
merged

On Sat, May 23, 2020, 2:52 PM Micah Kornfield <em...@gmail.com> wrote:

> Hi Wes,
> Will we still be able to comment on the PR once it is closed?
>
>
> If we want to be inclusive on feedback it might pay to wait until Tuesday
> evening US time to merge since it is a long weekend here.
>
> Thanks,
> Micah
>
> On Saturday, May 23, 2020, Wes McKinney <we...@gmail.com> wrote:
>
>> Hi folks -- I've addressed a good deal of feedback and added a lot of
>> comments and with Kou's help have got the build passing, It would be
>> great if this could be merged soon to unblock follow up PRs
>>
>> On Wed, May 20, 2020 at 11:55 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> > I just opened the PR https://github.com/apache/arrow/pull/7240
>> >
>> > I'm sorry it's so big. I really think this is the best way. The only
>> > further work I plan to do on it is to get the CI passing.
>> >
>> > On Wed, May 20, 2020 at 12:26 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> > >
>> > > I'd guess I'm < 24 hours away from putting up my initial PR for this
>> > > work. Since the work is large and (for all practical purposes) nearly
>> > > impossible to separate into separately merge-ready PRs, I'll start a
>> > > new e-mail thread describing what I've done in more detail and
>> > > proposing a path for merging the PR (so as to unblock PRs into
>> > > arrow/compute and avoid rebasing headaches) and addressing review
>> > > feedback that will likely take several weeks to fully accumulate.
>> > >
>> > > On Mon, May 11, 2020 at 5:17 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> > > >
>> > > > I'm working actively on this but perhaps as expected it has
>> ballooned
>> > > > into a very large project -- it's unclear at the moment whether I'll
>> > > > be able to break the work into smaller patches that are easier to
>> > > > digest. I'm working as fast as I can to have an initial
>> > > > feature-preserving PR up, but the changes to the src/arrow/compute
>> > > > directory are extensive, so I would please ask that we do not merge
>> > > > non-essential patches into cpp/src/arrow/compute until I get the PR
>> up
>> > > > (estimated later this week at present rate) so we can see where
>> things
>> > > > stand.
>> > > >
>> > > > On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <we...@gmail.com>
>> wrote:
>> > > > >
>> > > > > On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <
>> emkornfield@gmail.com> wrote:
>> > > > > >
>> > > > > > Hi Wes,
>> > > > > > I haven't had time to read the doc, but wanted to ask some
>> questions on
>> > > > > > points raised on the thread.
>> > > > > >
>> > > > > > * For efficiency, kernels used for array-expr evaluation should
>> write
>> > > > > > > into preallocated memory as their default mode. This enables
>> the
>> > > > > > > interpreter to avoid temporary memory allocations and improve
>> CPU
>> > > > > > > cache utilization. Almost none of our kernels are implemented
>> this way
>> > > > > > > currently.
>> > > > > >
>> > > > > > Did something change, I was pretty sure I submitted a patch a
>> while ago for
>> > > > > > boolean kernels, that separated out memory allocation from
>> computation.
>> > > > > > Which should allow for writing to the same memory.  Is this a
>> concern with
>> > > > > > the public Function APIs for the Kernel APIs themselves, or a
>> lower level
>> > > > > > implementation concern?
>> > > > >
>> > > > > Yes, you did in the internal implementation [1]. The concern is
>> the
>> > > > > public API and the general approach to implementing new kernels.
>> > > > >
>> > > > > I'm working on this right now (it's a large project so it will
>> take me
>> > > > > a little while to produce something to be reviewed) so bear with
>> me =)
>> > > > >
>> > > > > [1]:
>> https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76
>> > > > >
>> > > > > > * Sorting is generally handled by different data processing
>> nodes from
>> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and
>> Joins.
>> > > > > > > Projections and Filters use expressions, they do not sort.
>> > > > > >
>> > > > > > Would sorting the list-column elements per row be an array-expr?
>> > > > >
>> > > > > Yes, as that's an element-wise function. When I said sorting I was
>> > > > > referring to ORDER BY. The functions we have that do sorting do
>> so in
>> > > > > the context of a single array [2].
>> > > > >
>> > > > > A query engine must be able to sort a (potentially very large)
>> stream
>> > > > > of record batches. One approach is for the Sort operator to
>> exhaust
>> > > > > its child input, accumulating all of the record batches in memory
>> > > > > (spilling to disk as needed) and then sorting and emitting record
>> > > > > batches from the sorted records/tuples. See e.g. Impala's sorting
>> code
>> > > > > [3] [4]
>> > > > >
>> > > > > [2]:
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34
>> > > > > [3]:
>> https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h
>> > > > > [4]:
>> https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h
>> > > > >
>> > > > > >
>> > > > > > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <
>> wesmckinn@gmail.com> wrote:
>> > > > > >
>> > > > > > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <
>> antoine@python.org> wrote:
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
>> > > > > > > > >>
>> > > > > > > > >> That said, in the SortToIndices case, this wouldn't be a
>> problem,
>> > > > > > > since
>> > > > > > > > >> only the second pass writes to the output.
>> > > > > > > > >
>> > > > > > > > > This kernel is not valid for normal array-exprs (see the
>> spreadsheet I
>> > > > > > > > > linked), such as what you can write in SQL
>> > > > > > > > >
>> > > > > > > > > Kernels like SortToIndices are a different type of
>> function (in other
>> > > > > > > > > words, "not a SQL function") and so if we choose to allow
>> such a
>> > > > > > > > > "non-SQL-like" functions in the expression evaluator then
>> different
>> > > > > > > > > logic must be used.
>> > > > > > > >
>> > > > > > > > Hmm, I think that maybe I'm misunderstanding at which level
>> we're
>> > > > > > > > talking here.  SortToIndices() may not be a "SQL function",
>> but it looks
>> > > > > > > > like an important basic block for a query engine (since,
>> after all,
>> > > > > > > > sorting results is an often used feature in SQL and other
>> languages).
>> > > > > > > > So it should be usable *inside* the expression engine, even
>> though it's
>> > > > > > > > not part of the exposed vocabulary, no?
>> > > > > > >
>> > > > > > > No, not as part of "expressions" as they are defined in the
>> context of
>> > > > > > > SQL engines.
>> > > > > > >
>> > > > > > > Sorting is generally handled by different data processing
>> nodes from
>> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and
>> Joins.
>> > > > > > > Projections and Filters use expressions, they do not sort.
>> > > > > > >
>> > > > > > > > Regards
>> > > > > > > >
>> > > > > > > > Antoine.
>> > > > > > >
>>
>

Re: [C++] Revamping approach to Arrow compute kernel development

Posted by Micah Kornfield <em...@gmail.com>.
Hi Wes,
Will we still be able to comment on the PR once it is closed?


If we want to be inclusive on feedback it might pay to wait until Tuesday
evening US time to merge since it is a long weekend here.

Thanks,
Micah

On Saturday, May 23, 2020, Wes McKinney <we...@gmail.com> wrote:

> Hi folks -- I've addressed a good deal of feedback and added a lot of
> comments and with Kou's help have got the build passing, It would be
> great if this could be merged soon to unblock follow up PRs
>
> On Wed, May 20, 2020 at 11:55 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > I just opened the PR https://github.com/apache/arrow/pull/7240
> >
> > I'm sorry it's so big. I really think this is the best way. The only
> > further work I plan to do on it is to get the CI passing.
> >
> > On Wed, May 20, 2020 at 12:26 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > I'd guess I'm < 24 hours away from putting up my initial PR for this
> > > work. Since the work is large and (for all practical purposes) nearly
> > > impossible to separate into separately merge-ready PRs, I'll start a
> > > new e-mail thread describing what I've done in more detail and
> > > proposing a path for merging the PR (so as to unblock PRs into
> > > arrow/compute and avoid rebasing headaches) and addressing review
> > > feedback that will likely take several weeks to fully accumulate.
> > >
> > > On Mon, May 11, 2020 at 5:17 PM Wes McKinney <we...@gmail.com>
> wrote:
> > > >
> > > > I'm working actively on this but perhaps as expected it has ballooned
> > > > into a very large project -- it's unclear at the moment whether I'll
> > > > be able to break the work into smaller patches that are easier to
> > > > digest. I'm working as fast as I can to have an initial
> > > > feature-preserving PR up, but the changes to the src/arrow/compute
> > > > directory are extensive, so I would please ask that we do not merge
> > > > non-essential patches into cpp/src/arrow/compute until I get the PR
> up
> > > > (estimated later this week at present rate) so we can see where
> things
> > > > stand.
> > > >
> > > > On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <we...@gmail.com>
> wrote:
> > > > >
> > > > > On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <
> emkornfield@gmail.com> wrote:
> > > > > >
> > > > > > Hi Wes,
> > > > > > I haven't had time to read the doc, but wanted to ask some
> questions on
> > > > > > points raised on the thread.
> > > > > >
> > > > > > * For efficiency, kernels used for array-expr evaluation should
> write
> > > > > > > into preallocated memory as their default mode. This enables
> the
> > > > > > > interpreter to avoid temporary memory allocations and improve
> CPU
> > > > > > > cache utilization. Almost none of our kernels are implemented
> this way
> > > > > > > currently.
> > > > > >
> > > > > > Did something change, I was pretty sure I submitted a patch a
> while ago for
> > > > > > boolean kernels, that separated out memory allocation from
> computation.
> > > > > > Which should allow for writing to the same memory.  Is this a
> concern with
> > > > > > the public Function APIs for the Kernel APIs themselves, or a
> lower level
> > > > > > implementation concern?
> > > > >
> > > > > Yes, you did in the internal implementation [1]. The concern is the
> > > > > public API and the general approach to implementing new kernels.
> > > > >
> > > > > I'm working on this right now (it's a large project so it will
> take me
> > > > > a little while to produce something to be reviewed) so bear with
> me =)
> > > > >
> > > > > [1]: https://github.com/apache/arrow/commit/
> 4910fbf4fda05b864daaba820db08291e4afdcb6#diff-
> 561ea05d36150eb15842f452e3f07c76
> > > > >
> > > > > > * Sorting is generally handled by different data processing
> nodes from
> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and
> Joins.
> > > > > > > Projections and Filters use expressions, they do not sort.
> > > > > >
> > > > > > Would sorting the list-column elements per row be an array-expr?
> > > > >
> > > > > Yes, as that's an element-wise function. When I said sorting I was
> > > > > referring to ORDER BY. The functions we have that do sorting do so
> in
> > > > > the context of a single array [2].
> > > > >
> > > > > A query engine must be able to sort a (potentially very large)
> stream
> > > > > of record batches. One approach is for the Sort operator to exhaust
> > > > > its child input, accumulating all of the record batches in memory
> > > > > (spilling to disk as needed) and then sorting and emitting record
> > > > > batches from the sorted records/tuples. See e.g. Impala's sorting
> code
> > > > > [3] [4]
> > > > >
> > > > > [2]: https://github.com/apache/arrow/blob/master/cpp/src/
> arrow/compute/kernels/sort_to_indices.h#L34
> > > > > [3]: https://github.com/apache/impala/blob/master/be/src/
> runtime/sorter.h
> > > > > [4]: https://github.com/apache/impala/blob/master/be/src/
> exec/sort-node.h
> > > > >
> > > > > >
> > > > > > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <
> wesmckinn@gmail.com> wrote:
> > > > > >
> > > > > > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <
> antoine@python.org> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
> > > > > > > > >>
> > > > > > > > >> That said, in the SortToIndices case, this wouldn't be a
> problem,
> > > > > > > since
> > > > > > > > >> only the second pass writes to the output.
> > > > > > > > >
> > > > > > > > > This kernel is not valid for normal array-exprs (see the
> spreadsheet I
> > > > > > > > > linked), such as what you can write in SQL
> > > > > > > > >
> > > > > > > > > Kernels like SortToIndices are a different type of
> function (in other
> > > > > > > > > words, "not a SQL function") and so if we choose to allow
> such a
> > > > > > > > > "non-SQL-like" functions in the expression evaluator then
> different
> > > > > > > > > logic must be used.
> > > > > > > >
> > > > > > > > Hmm, I think that maybe I'm misunderstanding at which level
> we're
> > > > > > > > talking here.  SortToIndices() may not be a "SQL function",
> but it looks
> > > > > > > > like an important basic block for a query engine (since,
> after all,
> > > > > > > > sorting results is an often used feature in SQL and other
> languages).
> > > > > > > > So it should be usable *inside* the expression engine, even
> though it's
> > > > > > > > not part of the exposed vocabulary, no?
> > > > > > >
> > > > > > > No, not as part of "expressions" as they are defined in the
> context of
> > > > > > > SQL engines.
> > > > > > >
> > > > > > > Sorting is generally handled by different data processing
> nodes from
> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and
> Joins.
> > > > > > > Projections and Filters use expressions, they do not sort.
> > > > > > >
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Antoine.
> > > > > > >
>

Re: [C++] Revamping approach to Arrow compute kernel development

Posted by Wes McKinney <we...@gmail.com>.
Hi folks -- I've addressed a good deal of feedback and added a lot of
comments and with Kou's help have got the build passing, It would be
great if this could be merged soon to unblock follow up PRs

On Wed, May 20, 2020 at 11:55 PM Wes McKinney <we...@gmail.com> wrote:
>
> I just opened the PR https://github.com/apache/arrow/pull/7240
>
> I'm sorry it's so big. I really think this is the best way. The only
> further work I plan to do on it is to get the CI passing.
>
> On Wed, May 20, 2020 at 12:26 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > I'd guess I'm < 24 hours away from putting up my initial PR for this
> > work. Since the work is large and (for all practical purposes) nearly
> > impossible to separate into separately merge-ready PRs, I'll start a
> > new e-mail thread describing what I've done in more detail and
> > proposing a path for merging the PR (so as to unblock PRs into
> > arrow/compute and avoid rebasing headaches) and addressing review
> > feedback that will likely take several weeks to fully accumulate.
> >
> > On Mon, May 11, 2020 at 5:17 PM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > I'm working actively on this but perhaps as expected it has ballooned
> > > into a very large project -- it's unclear at the moment whether I'll
> > > be able to break the work into smaller patches that are easier to
> > > digest. I'm working as fast as I can to have an initial
> > > feature-preserving PR up, but the changes to the src/arrow/compute
> > > directory are extensive, so I would please ask that we do not merge
> > > non-essential patches into cpp/src/arrow/compute until I get the PR up
> > > (estimated later this week at present rate) so we can see where things
> > > stand.
> > >
> > > On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <we...@gmail.com> wrote:
> > > >
> > > > On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <em...@gmail.com> wrote:
> > > > >
> > > > > Hi Wes,
> > > > > I haven't had time to read the doc, but wanted to ask some questions on
> > > > > points raised on the thread.
> > > > >
> > > > > * For efficiency, kernels used for array-expr evaluation should write
> > > > > > into preallocated memory as their default mode. This enables the
> > > > > > interpreter to avoid temporary memory allocations and improve CPU
> > > > > > cache utilization. Almost none of our kernels are implemented this way
> > > > > > currently.
> > > > >
> > > > > Did something change, I was pretty sure I submitted a patch a while ago for
> > > > > boolean kernels, that separated out memory allocation from computation.
> > > > > Which should allow for writing to the same memory.  Is this a concern with
> > > > > the public Function APIs for the Kernel APIs themselves, or a lower level
> > > > > implementation concern?
> > > >
> > > > Yes, you did in the internal implementation [1]. The concern is the
> > > > public API and the general approach to implementing new kernels.
> > > >
> > > > I'm working on this right now (it's a large project so it will take me
> > > > a little while to produce something to be reviewed) so bear with me =)
> > > >
> > > > [1]: https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76
> > > >
> > > > > * Sorting is generally handled by different data processing nodes from
> > > > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > > > > > Projections and Filters use expressions, they do not sort.
> > > > >
> > > > > Would sorting the list-column elements per row be an array-expr?
> > > >
> > > > Yes, as that's an element-wise function. When I said sorting I was
> > > > referring to ORDER BY. The functions we have that do sorting do so in
> > > > the context of a single array [2].
> > > >
> > > > A query engine must be able to sort a (potentially very large) stream
> > > > of record batches. One approach is for the Sort operator to exhaust
> > > > its child input, accumulating all of the record batches in memory
> > > > (spilling to disk as needed) and then sorting and emitting record
> > > > batches from the sorted records/tuples. See e.g. Impala's sorting code
> > > > [3] [4]
> > > >
> > > > [2]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34
> > > > [3]: https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h
> > > > [4]: https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h
> > > >
> > > > >
> > > > > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <we...@gmail.com> wrote:
> > > > >
> > > > > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <an...@python.org> wrote:
> > > > > > >
> > > > > > >
> > > > > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
> > > > > > > >>
> > > > > > > >> That said, in the SortToIndices case, this wouldn't be a problem,
> > > > > > since
> > > > > > > >> only the second pass writes to the output.
> > > > > > > >
> > > > > > > > This kernel is not valid for normal array-exprs (see the spreadsheet I
> > > > > > > > linked), such as what you can write in SQL
> > > > > > > >
> > > > > > > > Kernels like SortToIndices are a different type of function (in other
> > > > > > > > words, "not a SQL function") and so if we choose to allow such a
> > > > > > > > "non-SQL-like" functions in the expression evaluator then different
> > > > > > > > logic must be used.
> > > > > > >
> > > > > > > Hmm, I think that maybe I'm misunderstanding at which level we're
> > > > > > > talking here.  SortToIndices() may not be a "SQL function", but it looks
> > > > > > > like an important basic block for a query engine (since, after all,
> > > > > > > sorting results is an often used feature in SQL and other languages).
> > > > > > > So it should be usable *inside* the expression engine, even though it's
> > > > > > > not part of the exposed vocabulary, no?
> > > > > >
> > > > > > No, not as part of "expressions" as they are defined in the context of
> > > > > > SQL engines.
> > > > > >
> > > > > > Sorting is generally handled by different data processing nodes from
> > > > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > > > > > Projections and Filters use expressions, they do not sort.
> > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > > >

Re: [C++] Revamping approach to Arrow compute kernel development

Posted by Wes McKinney <we...@gmail.com>.
I just opened the PR https://github.com/apache/arrow/pull/7240

I'm sorry it's so big. I really think this is the best way. The only
further work I plan to do on it is to get the CI passing.

On Wed, May 20, 2020 at 12:26 PM Wes McKinney <we...@gmail.com> wrote:
>
> I'd guess I'm < 24 hours away from putting up my initial PR for this
> work. Since the work is large and (for all practical purposes) nearly
> impossible to separate into separately merge-ready PRs, I'll start a
> new e-mail thread describing what I've done in more detail and
> proposing a path for merging the PR (so as to unblock PRs into
> arrow/compute and avoid rebasing headaches) and addressing review
> feedback that will likely take several weeks to fully accumulate.
>
> On Mon, May 11, 2020 at 5:17 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > I'm working actively on this but perhaps as expected it has ballooned
> > into a very large project -- it's unclear at the moment whether I'll
> > be able to break the work into smaller patches that are easier to
> > digest. I'm working as fast as I can to have an initial
> > feature-preserving PR up, but the changes to the src/arrow/compute
> > directory are extensive, so I would please ask that we do not merge
> > non-essential patches into cpp/src/arrow/compute until I get the PR up
> > (estimated later this week at present rate) so we can see where things
> > stand.
> >
> > On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <em...@gmail.com> wrote:
> > > >
> > > > Hi Wes,
> > > > I haven't had time to read the doc, but wanted to ask some questions on
> > > > points raised on the thread.
> > > >
> > > > * For efficiency, kernels used for array-expr evaluation should write
> > > > > into preallocated memory as their default mode. This enables the
> > > > > interpreter to avoid temporary memory allocations and improve CPU
> > > > > cache utilization. Almost none of our kernels are implemented this way
> > > > > currently.
> > > >
> > > > Did something change, I was pretty sure I submitted a patch a while ago for
> > > > boolean kernels, that separated out memory allocation from computation.
> > > > Which should allow for writing to the same memory.  Is this a concern with
> > > > the public Function APIs for the Kernel APIs themselves, or a lower level
> > > > implementation concern?
> > >
> > > Yes, you did in the internal implementation [1]. The concern is the
> > > public API and the general approach to implementing new kernels.
> > >
> > > I'm working on this right now (it's a large project so it will take me
> > > a little while to produce something to be reviewed) so bear with me =)
> > >
> > > [1]: https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76
> > >
> > > > * Sorting is generally handled by different data processing nodes from
> > > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > > > > Projections and Filters use expressions, they do not sort.
> > > >
> > > > Would sorting the list-column elements per row be an array-expr?
> > >
> > > Yes, as that's an element-wise function. When I said sorting I was
> > > referring to ORDER BY. The functions we have that do sorting do so in
> > > the context of a single array [2].
> > >
> > > A query engine must be able to sort a (potentially very large) stream
> > > of record batches. One approach is for the Sort operator to exhaust
> > > its child input, accumulating all of the record batches in memory
> > > (spilling to disk as needed) and then sorting and emitting record
> > > batches from the sorted records/tuples. See e.g. Impala's sorting code
> > > [3] [4]
> > >
> > > [2]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34
> > > [3]: https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h
> > > [4]: https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h
> > >
> > > >
> > > > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <we...@gmail.com> wrote:
> > > >
> > > > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <an...@python.org> wrote:
> > > > > >
> > > > > >
> > > > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
> > > > > > >>
> > > > > > >> That said, in the SortToIndices case, this wouldn't be a problem,
> > > > > since
> > > > > > >> only the second pass writes to the output.
> > > > > > >
> > > > > > > This kernel is not valid for normal array-exprs (see the spreadsheet I
> > > > > > > linked), such as what you can write in SQL
> > > > > > >
> > > > > > > Kernels like SortToIndices are a different type of function (in other
> > > > > > > words, "not a SQL function") and so if we choose to allow such a
> > > > > > > "non-SQL-like" functions in the expression evaluator then different
> > > > > > > logic must be used.
> > > > > >
> > > > > > Hmm, I think that maybe I'm misunderstanding at which level we're
> > > > > > talking here.  SortToIndices() may not be a "SQL function", but it looks
> > > > > > like an important basic block for a query engine (since, after all,
> > > > > > sorting results is an often used feature in SQL and other languages).
> > > > > > So it should be usable *inside* the expression engine, even though it's
> > > > > > not part of the exposed vocabulary, no?
> > > > >
> > > > > No, not as part of "expressions" as they are defined in the context of
> > > > > SQL engines.
> > > > >
> > > > > Sorting is generally handled by different data processing nodes from
> > > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > > > > Projections and Filters use expressions, they do not sort.
> > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > >

Re: [C++] Revamping approach to Arrow compute kernel development

Posted by Wes McKinney <we...@gmail.com>.
I'd guess I'm < 24 hours away from putting up my initial PR for this
work. Since the work is large and (for all practical purposes) nearly
impossible to separate into separately merge-ready PRs, I'll start a
new e-mail thread describing what I've done in more detail and
proposing a path for merging the PR (so as to unblock PRs into
arrow/compute and avoid rebasing headaches) and addressing review
feedback that will likely take several weeks to fully accumulate.

On Mon, May 11, 2020 at 5:17 PM Wes McKinney <we...@gmail.com> wrote:
>
> I'm working actively on this but perhaps as expected it has ballooned
> into a very large project -- it's unclear at the moment whether I'll
> be able to break the work into smaller patches that are easier to
> digest. I'm working as fast as I can to have an initial
> feature-preserving PR up, but the changes to the src/arrow/compute
> directory are extensive, so I would please ask that we do not merge
> non-essential patches into cpp/src/arrow/compute until I get the PR up
> (estimated later this week at present rate) so we can see where things
> stand.
>
> On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <em...@gmail.com> wrote:
> > >
> > > Hi Wes,
> > > I haven't had time to read the doc, but wanted to ask some questions on
> > > points raised on the thread.
> > >
> > > * For efficiency, kernels used for array-expr evaluation should write
> > > > into preallocated memory as their default mode. This enables the
> > > > interpreter to avoid temporary memory allocations and improve CPU
> > > > cache utilization. Almost none of our kernels are implemented this way
> > > > currently.
> > >
> > > Did something change, I was pretty sure I submitted a patch a while ago for
> > > boolean kernels, that separated out memory allocation from computation.
> > > Which should allow for writing to the same memory.  Is this a concern with
> > > the public Function APIs for the Kernel APIs themselves, or a lower level
> > > implementation concern?
> >
> > Yes, you did in the internal implementation [1]. The concern is the
> > public API and the general approach to implementing new kernels.
> >
> > I'm working on this right now (it's a large project so it will take me
> > a little while to produce something to be reviewed) so bear with me =)
> >
> > [1]: https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76
> >
> > > * Sorting is generally handled by different data processing nodes from
> > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > > > Projections and Filters use expressions, they do not sort.
> > >
> > > Would sorting the list-column elements per row be an array-expr?
> >
> > Yes, as that's an element-wise function. When I said sorting I was
> > referring to ORDER BY. The functions we have that do sorting do so in
> > the context of a single array [2].
> >
> > A query engine must be able to sort a (potentially very large) stream
> > of record batches. One approach is for the Sort operator to exhaust
> > its child input, accumulating all of the record batches in memory
> > (spilling to disk as needed) and then sorting and emitting record
> > batches from the sorted records/tuples. See e.g. Impala's sorting code
> > [3] [4]
> >
> > [2]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34
> > [3]: https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h
> > [4]: https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h
> >
> > >
> > > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <an...@python.org> wrote:
> > > > >
> > > > >
> > > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
> > > > > >>
> > > > > >> That said, in the SortToIndices case, this wouldn't be a problem,
> > > > since
> > > > > >> only the second pass writes to the output.
> > > > > >
> > > > > > This kernel is not valid for normal array-exprs (see the spreadsheet I
> > > > > > linked), such as what you can write in SQL
> > > > > >
> > > > > > Kernels like SortToIndices are a different type of function (in other
> > > > > > words, "not a SQL function") and so if we choose to allow such a
> > > > > > "non-SQL-like" functions in the expression evaluator then different
> > > > > > logic must be used.
> > > > >
> > > > > Hmm, I think that maybe I'm misunderstanding at which level we're
> > > > > talking here.  SortToIndices() may not be a "SQL function", but it looks
> > > > > like an important basic block for a query engine (since, after all,
> > > > > sorting results is an often used feature in SQL and other languages).
> > > > > So it should be usable *inside* the expression engine, even though it's
> > > > > not part of the exposed vocabulary, no?
> > > >
> > > > No, not as part of "expressions" as they are defined in the context of
> > > > SQL engines.
> > > >
> > > > Sorting is generally handled by different data processing nodes from
> > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > > > Projections and Filters use expressions, they do not sort.
> > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > >