You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/04 16:12:21 UTC

[GitHub] [arrow-datafusion] ic4y commented on pull request #1520: use bumpalo for GroupState

ic4y commented on pull request #1520:
URL: https://github.com/apache/arrow-datafusion/pull/1520#issuecomment-1004941980


   From
   ```rust
   struct Accumulators {
   
       map: RawTable<(u64, usize)>,
   
       group_states: Vec<GroupState>,
   }
   ```
   To
   ```rust
   struct Accumulators {
   
       map: RawTable<(u64, usize)>,
   
       group_states:BumpVec<GroupState>,
   }
   ```
   
   By using bumpalo to allocate memory for group_states, the time to destruct group_states can be greatly reduced in the case of high cardinality, and the time consumption of destructuring group_states is almost not counted in pprf
   
   
   The total test data is 350 million, and the deduplication number of user_id is 50 million。
   `sql	: select count(1) from (select user_id from event group by user_id)a`
   
   **master:**
   	drop_in_place<GroupState>   takes 6s(50%)  ,total  14s
   	
   ![image](https://user-images.githubusercontent.com/83933160/148085458-434bf55e-f6d4-45d7-8c59-e12cb4479a7b.png)
   
   **bumpalo:**
           drop_in_place<GroupState>   takes 0s(not counted)  ,total  8s(40% increase)
   ![image](https://user-images.githubusercontent.com/83933160/148085572-441104d5-0b90-4959-9da3-7c37d5e0efdd.png)
   
   Under the TPC-H benchmark test, there is almost no difference. I think the reason is that the grouping base is not high enough.
   ![image](https://user-images.githubusercontent.com/83933160/148085917-c4439fd5-2fad-486e-a8b8-a09d3beb98c8.png)
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org