You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2021/01/28 03:05:00 UTC

[jira] [Resolved] (ARROW-11300) [Rust][DataFusion] Improve hash aggregate performance with large number of groups in

     [ https://issues.apache.org/jira/browse/ARROW-11300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jorge Leitão resolved ARROW-11300.
----------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 9271
[https://github.com/apache/arrow/pull/9271]

> [Rust][DataFusion] Improve hash aggregate performance with large number of groups in 
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-11300
>                 URL: https://issues.apache.org/jira/browse/ARROW-11300
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Daniël Heres
>            Assignee: Daniël Heres
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>         Attachments: image-2021-01-18-13-00-36-685.png
>
>          Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Currently, hash aggregates are performing well when having a small number of output groups, but the results on db-benchmark [https://github.com/h2oai/db-benchmark/pull/182] test on data with a high number of output groups.
> [https://github.com/apache/arrow/pull/9234] improved the situation a bit, but DataFusion is still much slower than even the slowest result when comparing to the published results.
> This seems mostly having to do with the way we use individual key/groups.
> For each new key, we _take_ the indices of the group, resulting in lots of small allocations and cache unfriendliness and other overhead if we have many keys with only a small (just 1-2) number of rows per group in a batch. Also the indices are converted from a Vec to an Array, making the situation worse (accounts for ~22% of the instructions on the master branch!), other profiling results seem to be from related allocations too.
> To make it efficient for tiny groups, we should probably change the hash aggregate algorithm to _take_ based on _all_ indices from the batch in one go, and "slice" into the resulting array for the individual accumulators.
>  
> Here is some profiling info of the db-benchmark questions 1-5 against master:
> !image-2021-01-18-13-00-36-685.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)