You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "mingmwang (via GitHub)" <gi...@apache.org> on 2023/06/27 04:56:02 UTC

[GitHub] [arrow-datafusion] mingmwang commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregation

mingmwang commented on issue #4973:
URL: https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1608796360

@alamb @tustvold

If I understand correctly, in this new proposal, we will not use the `Row Layout` to manage the `Accumulator` states, right?
I think the `Row Layout` is still useful in the case of high cardinality grouping along with multiple `Accumulators` (avg, min, max,..). With the `Row Layout`, the states of multiple `Accumulators` belong to the same group can be updated in the tight loop. If we let each `Accumulator` manages its own state, I guess there will be more random memory access.

And below is my observation:

1). The per-group type dispatching overhead is not the major bottleneck, that's also why the PR #6657 only get 10%~ 15%
performance gain.
2). The random memory access and mem stall should be the major bottleneck in the case of high cardinality grouping .

And I checked DuckDB's source code, looks like they also use the `Row Layout` to manage the aggregator states.
They use C++ templates in the state update functions also. I'm going to take a closer look at the source code this week and next week.

https://github.com/duckdb/duckdb/blob/e49cf8132dc07324e0bb165593c8160837fcef3a/src/include/duckdb/common/types/row/row_layout.hpp

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org