You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/24 19:13:55 UTC

[GitHub] [arrow-datafusion] ravlio opened a new issue #418: performance considerations of create_key_for_col (HashAggregate)

ravlio opened a new issue #418:
URL: https://github.com/apache/arrow-datafusion/issues/418


   Hello! I'm just learning Rust and maybe I'm wrong, please correct me, if so. I noticed that this part with key creation https://github.com/apache/arrow-datafusion/blob/174226c086a4838eab2a238853b4871c295c0189/datafusion/src/physical_plan/hash_aggregate.rs#L514 is called on each row. And I see that you do match and downcast on each row on each grouping key. All this should add a lot of additional CPU instructions (I haven't tested it, just my thoughts). Isn't this a suboptimal approach? For instance, we could leverage generics or macro here. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #418:
URL: https://github.com/apache/arrow-datafusion/issues/418#issuecomment-850491025


   @ravlio 
   
   You are very right, that part is suboptimal.
   Also in hash aggregates there are a couple of other things:
   
   * Keys are created by row and indexed by ` Vec<u8>`. Each `Vec<u8>`  value will be hashed, but is also rehashed as the hash table grows (rusts hashmap doesn' t remember hash values by default, so it has to compute it again every time the capacity of the table is exhausted).
   We should do something similar (vectorized hashing) in the hash aggregates as we do in hash join. There the `create_hashes` function generates `u64` hash values for each row in one simple loop without downcasting / converting per row. The challenging part of this is detecting hash collisions and making sure that that is fast too.
   
   * Hash aggregates with small groups generate lot of small allocations and dynamically typed scalar values and trait objects for the aggregate states. Also in some parts, intermediate `Vec`s are allocated which also show up in profiles. 
   I've addressed some performance issues in different parts earlier, however there are still large gains possible, mainly for some more challenging queries. I think what would be bes in the long run is building a mutable typed array based for the aggregation states, and keeping only the *offsets* to that array in a hash table structure. This would allow for vectorized updating the states like this (pseudo code):
   
   ```rust
   // sum
   for (offset, val) in batch_offsets.chain(batch_vals) {
      states[offset] += val;
   }
   // count
   for offset in batch_offsets {
      states[offset] += 1;
   }
   // avg 
   for (offset, val) in batch_offsets.chain(batch_vals) {
      state_avg_sum[offset] += val;
      state_avg_rows[offset] += 1;
   }
   ```
   
   After the hash aggregation - the arrays don't need any further processing as all of the `offset` values are already increasing for each row.
    (currently the group by values are converted to arrays again, which again takes time).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #418:
URL: https://github.com/apache/arrow-datafusion/issues/418#issuecomment-851552960


   @jorgecarleitao 
   
   Interesting!
   I did some earlier experiments with the vectorized hashing too (and saw similar speed ups for low-cardinality aggregates), but got a bit stuck in making the hash collision performant enough to not slow it down.
   
   I am not sure where the hash collision detection happens in the experimental branch yet - what if two groups map to the same `u64`  hash? It is quite hard to trigger that without writing some unit tests - I also got all the test passing without implementing something for collisions. A trivial way to detect collisions while not changing the code rigorously (comparing `GroupByValues` while inserting) made it perform slower than the original implementation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #418:
URL: https://github.com/apache/arrow-datafusion/issues/418#issuecomment-851522837


   fwiw, I have used the hashing [in the experimental branch](https://github.com/apache/arrow-datafusion/pull/68/files#diff-03876812a8bef4074e517600fdcf8e6b49f1ea24df44905d6d806836fd61b2a8R328) and I am getting some in-memory speedups:
   
   |                        |master (ms)|1b48f389 (ms)|
   |------------------------|----------:|------------:|
   |Q1 memory               |      332.9|        239.6|
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #418: HashAggregate performance improvements

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #418:
URL: https://github.com/apache/arrow-datafusion/issues/418#issuecomment-1063170785


   FWIW I think @yjshen  is working on this using the #1849  and row format frameworks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on issue #418:
URL: https://github.com/apache/arrow-datafusion/issues/418#issuecomment-850491025


   @ravlio 
   
   You are very right, that part is suboptimal.
   Also in hash aggregates there are a couple of other things:
   
   * Keys are created by row and indexed by ` Vec<u8>`. Each `Vec<u8>`  value will be hashed, but is also rehashed as the hash table grows (rusts hashmap doesn' t remember hash values by default, so it has to compute it again every time the capacity of the table is exhausted).
   We should do something similar (vectorized hashing) in the hash aggregates as we do in hash join. There the `create_hashes` function generates `u64` hash values for each row in one simple loop without downcasting / converting per row. The challenging part of this is detecting hash collisions and making sure that that is fast too.
   
   * Hash aggregates with small groups generate lot of small allocations and dynamically typed scalar values and trait objects for the aggregate states. Also in some parts, intermediate `Vec`s are allocated which also show up in profiles. 
   I've addressed some performance issues in different parts earlier, however there are still large gains possible, mainly for some more challenging queries. I think what would be bes in the long run is building a mutable typed array based for the aggregation states, and keeping only the *offsets* to that array in a hash table structure. This would allow for vectorized updating the states like this (pseudo code):
   
   ```rust
   // sum
   for (offset, val) in batch_offsets.chain(batch_vals) {
      states[offset] += val;
   }
   // count
   for offset in batch_offsets {
      states[offset] += 1;
   }
   // avg 
   for (offset, val) in batch_offsets.chain(batch_vals) {
      state_avg_sum[offset] += val;
      state_avg_count[offset] += 1;
   }
   ```
   
   After the hash aggregation - the arrays don't need any further processing as all of the `offset` values are already increasing for each row. Only the avg here needs to run a `divide`  at the end, which is fast as we can use the arrow kernel directly.
    (currently the group by values are converted to arrays again, which again takes time).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #418:
URL: https://github.com/apache/arrow-datafusion/issues/418#issuecomment-851518881


   Thanks for the input @jhorstmann - that adds some support for the idea! Something like ~2x speed up for more challenging queries where DF currently "struggles" (or bigger for some extreme queries like 1 value per group) sounds about what I would expect from making the change in DataFusion as well based on some profiling results, in particular for high cardinality queries. It might be an idea to do a count on the number of distinct values inside a batch (that should be doable as we already are inserting the offsets to the HashMap) to decide whether to do a `take` + `batch_update` on low cardinality or whether to combine it in one loop for high cardinality updates?
   
   Arrow misses some kind of mutable data currently that can mutate data at offsets (rather than append values) - what are you using for that, or did you build your own structure?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jhorstmann commented on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on issue #418:
URL: https://github.com/apache/arrow-datafusion/issues/418#issuecomment-851422139


   > I think what would be bes in the long run is building a mutable typed array based for the aggregation states, and keeping only the _offsets_ to that array in a hash table structure.
   
   I agree and also did some experiments around this the last weeks. The slowest part is unfortunately still that you'd need a hashmap to build this array of consecutive offsets. But this part is much easier to specialize for different key types or number of group by columns. Updating the accumulator states as in your code then trades random access into memory vs dynamic method calls and should be faster.
   
   I did not base those experiments on arrow or datafusion though, the approach in datafusion with collecting the row indices and using `take` + `batch_update` could be better for small cardinalities but might otherwise have bigger overhead. The two approaches I compared where
   
    - a simple `HashMap<Vec<Scalar>, Vec<Box<dyn Accumulator>>>` that gets updated row by row and afterwards iterated for each group and aggregate column
    - a `HashMap<Vec<Scalar>, usize>`, the value being a sequential offset, at the same time collection the keys into a `Vec<Vec<Scalar>>` that corresponds to those offsets. Afterwards doing basically the same as in your example code, only requiring a dispatch on the array type once.
   
   In a microbenchmark with random data and grouping by two u32 columns (about 1000 different groups), ignoring any nulls, the second one was about 2x faster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on issue #418:
URL: https://github.com/apache/arrow-datafusion/issues/418#issuecomment-851552960


   @jorgecarleitao 
   
   Interesting!
   I did some earlier experiments with the vectorized hashing too (and saw similar speed ups for low-cardinality aggregates), but got a bit stuck in making the hash collision performant enough to not slow it down.
   
   I am not sure where the hash collision detection happens in the experimental branch yet - what if two values map to the same `u64`  hash? It is quite hard to trigger that without writing some unit tests - I also got all the test passing without implementing something for collisions. A trivial way to detect collisions while not changing the code rigorously (comparing `GroupByValues` while inserting) made it perform slower than the original implementation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #418: [question] performance considerations of create_key_for_col (HashAggregate)

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on issue #418:
URL: https://github.com/apache/arrow-datafusion/issues/418#issuecomment-850491025


   @ravlio 
   
   You are very right, that part is suboptimal.
   Also in hash aggregates there are a couple of other things:
   
   * Keys are created by row and indexed by ` Vec<u8>`. Each `Vec<u8>`  value will be hashed, but is also rehashed as the hash table grows (rusts hashmap doesn' t remember hash values by default, so it has to compute it again every time the capacity of the table is exhausted).
   We should do something similar (vectorized hashing) in the hash aggregates as we do in hash join. There the `create_hashes` function generates `u64` hash values for each row in one simple loop without downcasting / converting per row. The challenging part of this is detecting hash collisions and making sure that that is fast too.
   
   * Hash aggregates with small groups generate lot of small allocations and dynamically typed scalar values and trait objects for the aggregate states. Also in some parts, intermediate `Vec`s are allocated which also show up in profiles. 
   I've addressed some performance issues in different parts earlier, however there are still large gains possible, mainly for some more challenging queries. I think what would be bes in the long run is building a mutable typed array based for the aggregation states, and keeping only the *offsets* to that array in a hash table structure. This would allow for vectorized updating the states like this (pseudo code):
   
   ```rust
   // sum
   for (offset, val) in batch_offsets.chain(batch_vals) {
      states[offset] += val;
   }
   // count
   for offset in batch_offsets {
      states[offset] += 1;
   }
   // avg 
   for (offset, val) in batch_offsets.chain(batch_vals) {
      state_avg_sum[offset] += val;
      state_avg_rows[offset] += 1;
   }
   ```
   
   After the hash aggregation - the arrays don't need any further processing as all of the `offset` values are already increasing for each row. Only the avg here needs to run a `divide`  at the end, which is fast as we can use the arrow kernel directly.
    (currently the group by values are converted to arrays again, which again takes time).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org