You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/29 17:29:49 UTC

[GitHub] [arrow-datafusion] ic4y opened a new issue #1504: The destruction of GroupState in high cardinality aggregation takes a lot of time

ic4y opened a new issue #1504:
URL: https://github.com/apache/arrow-datafusion/issues/1504


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   The test is as follows(4core 16G MacOS)
   ```select count(1) from (select user_id from event group by user_id)a ```
   
   The total data is 350 million, and the user_id deduplication number is 5 million. The entire query takes 15s. Viewing through pprf, it is found that **about 60% of the time is destructing the GroupState**.
   
   ![image](https://user-images.githubusercontent.com/83933160/147687920-9257e90d-d611-4fee-b74b-d724d5bfcfbd.png)
   
   
   **Describe the solution you'd like**
   
   Using [bumpalo](https://github.com/fitzgen/bumpalo) to allocate GroupState to a chunk of memory as much as possible, and then release it wholely. Will the destruction time be much better in this way?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] ic4y closed issue #1504: The destruction of GroupState in high cardinality aggregation takes a lot of time

Posted by GitBox <gi...@apache.org>.
ic4y closed issue #1504:
URL: https://github.com/apache/arrow-datafusion/issues/1504


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] ic4y commented on issue #1504: The destruction of GroupState in high cardinality aggregation takes a lot of time

Posted by GitBox <gi...@apache.org>.
ic4y commented on issue #1504:
URL: https://github.com/apache/arrow-datafusion/issues/1504#issuecomment-1006710778


   > By using --features "mimalloc", it was found that the test results did not differ much.
   
   Using --features "mimalloc" did not take effect.
   
   Add the following code  in the `main.rs` (this is mentioned in the [user guide](https://github.com/apache/arrow-datafusion/blob/ecb09d9e37a4ea8f06d145c4fdcbdb3b8bb64ab7/docs/source/user-guide/library.md)) can solve this problem
   
   ```rust
   #[global_allocator]
   static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai commented on issue #1504: The destruction of GroupState in high cardinality aggregation takes a lot of time

Posted by GitBox <gi...@apache.org>.
rdettai commented on issue #1504:
URL: https://github.com/apache/arrow-datafusion/issues/1504#issuecomment-1004244530


   Thanks for the analysis @ic4y ! I am quite surprised we pay the fragmentation that comes from the row oriented structure of `Accumulators` that much more at de-allocation time than when computing the actual aggregates. 
   
   I guess the arena strategy that you are suggesting should work, though I don't know bumpalo specifically. 
   
   It is worth referencing your discussion about making accumulators column based in #956, which I believe should also solve this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] ic4y commented on issue #1504: The destruction of GroupState in high cardinality aggregation takes a lot of time

Posted by GitBox <gi...@apache.org>.
ic4y commented on issue #1504:
URL: https://github.com/apache/arrow-datafusion/issues/1504#issuecomment-1002915299


   By using --features "mimalloc", it was found that the test results did not differ much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org