You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/28 15:18:29 UTC

[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

Dandandan edited a comment on issue #418:
URL: https://github.com/apache/arrow-datafusion/issues/418#issuecomment-850491025


   @ravlio 
   
   You are very right, that part is suboptimal.
   Also in hash aggregates there are a couple of other things:
   
   * Keys are created by row and indexed by ` Vec<u8>`. Each `Vec<u8>`  value will be hashed, but is also rehashed as the hash table grows (rusts hashmap doesn' t remember hash values by default, so it has to compute it again every time the capacity of the table is exhausted).
   We should do something similar (vectorized hashing) in the hash aggregates as we do in hash join. There the `create_hashes` function generates `u64` hash values for each row in one simple loop without downcasting / converting per row. The challenging part of this is detecting hash collisions and making sure that that is fast too.
   
   * Hash aggregates with small groups generate lot of small allocations and dynamically typed scalar values and trait objects for the aggregate states. Also in some parts, intermediate `Vec`s are allocated which also show up in profiles. 
   I've addressed some performance issues in different parts earlier, however there are still large gains possible, mainly for some more challenging queries. I think what would be bes in the long run is building a mutable typed array based for the aggregation states, and keeping only the *offsets* to that array in a hash table structure. This would allow for vectorized updating the states like this (pseudo code):
   
   ```rust
   // sum
   for (offset, val) in batch_offsets.chain(batch_vals) {
      states[offset] += val;
   }
   // count
   for offset in batch_offsets {
      states[offset] += 1;
   }
   // avg 
   for (offset, val) in batch_offsets.chain(batch_vals) {
      state_avg_sum[offset] += val;
      state_avg_rows[offset] += 1;
   }
   ```
   
   After the hash aggregation - the arrays don't need any further processing as all of the `offset` values are already increasing for each row. Only the avg here needs to run a `divide`  at the end, which is fast as we can use the arrow kernel directly.
    (currently the group by values are converted to arrays again, which again takes time).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org