You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/16 21:23:07 UTC
[GitHub] [arrow-datafusion] alamb commented on issue #1456: The Eq method in HashAggregate takes up a lot of time, how to optimize it

alamb commented on issue #1456:
URL: https://github.com/apache/arrow-datafusion/issues/1456#issuecomment-996204065


   I think `eq_array` is a symptom, rather than the root cause
   
   The eq_array is necessary in the current hash aggregate  implementation to detect hash collisions:
   
   ```
                                               ┌──────────────────────────────────────┐
                   ┌─────────┐                 │Bucket                                │
                   │         │                 │(                                     │
                   │HashTable│        ┌───────▶│ grp_key1: ScalarValue                │
                   │         │────────┘        │ grp_key2: ScalarValue                │
                   │         │                 │)                                     │
                   │         │                 └──────────────────────────────────────┘
                   └─────────┘                                                         
                                                                                       
                                                                                       
                                                                                       
                                                                                       
     ┌─────────────┬─────────────┐                                                     
     │Group Column │Group Column │                                                     
     │      A      │      B      │             Step 1:  hash(grp_key1, grp_key2) is    
     └─────────────┴─────────────┘                     computed (vectorized)           
           ...           ...                                                           
     ┌─────────────┬─────────────┐             Step 2: bucket for that hash value is   
     │  grp_key1   │  grp_key2   │                           obtained                  
     └─────────────┴─────────────┘                                                     
           ...           ...                Step 3: Validate that the values stored in 
                                             the bucket are the same as the input key  
                                              (aka that there are no hash collisions)  
                                                                                       
                                                                                       
                                                                                       
                                                                                       
                                                                                       
           eq_array is used for step 3                                                 
                                                                                       
   ```
   
   I am pretty sure this code is correct, though since it is general purpose (works for all types) there is non trivial dispatch overhead
   
   If you are trying to speed up a distinct aggregate calculation I suggest you look into special casing group keys which are native types and which can be packed into fixed length byte arrays (so they can be compared using mem comparisons rather than dispatching on each column) 
   
   Another way of saying this is "don't try and remove `eq_array` but instead try to remove the use of `Scalar` entirely
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org