You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/02/10 01:06:00 UTC
[jira] [Resolved] (ARROW-4124) [C++] Abstract aggregation kernel
API
[ https://issues.apache.org/jira/browse/ARROW-4124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney resolved ARROW-4124.
---------------------------------
Resolution: Fixed
Issue resolved by pull request 3407
[https://github.com/apache/arrow/pull/3407]
> [C++] Abstract aggregation kernel API
> -------------------------------------
>
> Key: ARROW-4124
> URL: https://issues.apache.org/jira/browse/ARROW-4124
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Wes McKinney
> Assignee: Francois Saint-Jacques
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.13.0
>
> Time Spent: 10.5h
> Remaining Estimate: 0h
>
> Related to the particular details of implementing various aggregation types, we should first put a bit of energy into the abstract API for aggregating data in a multi-threaded setting
> Aggregators must support both hash/group (e.g. "group by" in SQL or data frame libraries) modes and non-group modes.
> Aggregations ideally should also support filter pushdown. For example:
> {code}
> select $AGG($EXPR)
> from $TABLE
> where $PREDICATE
> {code}
> Some systems might materialize the post-predicate / filtered version of {{$EXPR}}, then aggregate that. pandas does this for example. Vectorized performance can be much improved by filtering inside the aggregation kernel. How the predicate true/false values are handled may depend on the implementation details of the kernel (e.g. SUM or MEAN will be a bit different from PRODUCT)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)