You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/30 00:42:09 UTC

[GitHub] [arrow-datafusion] yjshen commented on pull request #2375: WIP: Use row format for aggregate

yjshen commented on PR #2375:
URL: https://github.com/apache/arrow-datafusion/pull/2375#issuecomment-1113877450

   Sorry to mix two things into one PR. I would divide this as separate PRs. One for each of these ideas:
   
   1. Promote `physical-plan/hash_aggregates.rs` to a directory, and rename it to `aggregates`. We already have a hash-based implementation, `GroupedHashAggregateStream` for aggregate with grouping keys, and a non-hash implementation for aggregate without grouping keys (It's a single record state but named `HashAggregateStream` although it's not related to `Hash` at all).
   
   - We could further enrich the aggregation method from hash-based to sort-based at runtime when we are run out of memory, as described in https://github.com/apache/arrow-datafusion/issues/1570
   
   2. Use row format to store grouping keys and accumulator states when all accumulator states are fixed-sized. Use `Vec<ScalarValue>` for all other cases (when we have at least one var length accumulator state, or any of the `AggregateExpr`s doesn't support row-based accumulator yet).
   
   >  Maybe we can have a config setting for which one to use
   
   I think the choice between row-based accumulator states vs `Vec<ScalarValue>` based accumulator states will depend on row-based accumulator capability during query execution, we are only using row-based aggregate states when we have all its accumulators support. (If and only if we are sure that the row-based version will always outperform `Vec<ScalarValue>` version whenever applicable, based on benchmark results of course.)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org