You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/30 22:37:17 UTC

[GitHub] [arrow] westonpace commented on pull request #13130: ARROW-15591: [C++] Add support for aggregation to the Substrait consumer

westonpace commented on PR #13130:
URL: https://github.com/apache/arrow/pull/13130#issuecomment-1171742125

   So I studied this a bit more and I also had a very poor understanding of the aggregate rel in the past.  Substrait's aggregate node is considerably more flexible than Acero's current implementation.  There will be some limitations.
   
    * Substrait's aggregate node supports "grouping sets" which is the ability to group by multiple sets of keys at the same time.  For example: https://www.db-fiddle.com/f/4QQnTGkyENZtxBKrWtimAT/0  Acero does not have support for grouping sets.
      * If `AggregateRel::groupings` has more than 1 element then we should reject the plan.
      * If someone needed to polyfill this behavior they could do so with a `UNION ALL` I believe (although I don't think we yet support the union node)
    * Substrait's grouping keys can be arbitrary expressions.  Acero's grouping keys are `FieldRef`.
      * All instances of `AggregateRel::groupings[0]::grouping_expressions` must be direct references or else we reject the plan.
      * If someone needed to polyfill this behavior they could do so by first running a projection so that the expressions become fields to reference.
    * Substrait's measures support filtering.  Acero does not support this.
      * If any `AggregateRel::measures::filter` is specified then we should reject the plan.
      * I do not believe this can be fixed with a straightforward polyfill, we should add a JIRA.
    * Substrait's measures are potentially non-unary functions
      * If any `AggregateRel::measures::measure::arguments` has size != 1 then we should reject the plan.
      * I do not believe this can be fixed with a straightforward polyfill, however, there are not yet any standard aggregate functions defined which are non-unary.
    * Substrait's measures take in arguments as expressions
      * If any `AggregateRel::measures::measure::arguments` is not a direct reference then we should reject the plan.
      * If someone needed to polyfill this behavior they could do so by first running a projection so that the expressions become fields to reference.
   
   So, the mapping between Acero's AggregateNodeOptions and Substrait's AggregateRel should be:
   
   *  `AggregateNodeOptions::keys[N] == AggregateRel::groupings[0]::grouping_expressions[N]::selection::struct_field::field`
   * `AggregateNodeOptions::aggregates[N]::function == AggregateRel::measures[N]::measure::function_reference`
   * `AggregateNodeOptions::aggregates[N]::options == IGNORE FOR NOW`
   * `AggregateNodeOptions::aggregates[N]::target == AggregateRel::measures[N]::measure::arguments[0]::selection::struct_field::field`
   * `AggregateNodeOptions::aggregates[N]::name == ALWAYS EMPTY`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org