You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/24 22:24:38 UTC

[GitHub] [arrow] Dandandan opened a new pull request #8765: ARROW-10722: Reduce overhead in GroupByScalar, improve benchmarks

Dandandan opened a new pull request #8765:
URL: https://github.com/apache/arrow/pull/8765


   This PR reduces the size of `GroupByScalar` From 32 bytes to 24 bytes by using `Box<str>`.
   
   Difference in speed seems to be minimal, at least in current state.
   
   I think in the future, it could be nice to see if the data could be packed efficiently by using the schema instead of creating "dynamic" values.
   
   This PR also tries to improve reproducability in the test a bit by using the seed in the random number generator (still a quite noisy on my machine though).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao closed pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

jorgecarleitao closed pull request #8765:
URL: https://github.com/apache/arrow/pull/8765


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217


   @jorgecarleitao Not really on performance as current benchmarks / queries show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can be made faster itself. I think one part would be to create a key that can be hashed faster. Now the hashing algorithm hashes each value individual GroupByValue instead of working on a byte array. The latter one could in principle be faster. Some specialized code could also be made for hashing based on 1 column only.
   
   It can have a larger impact on _**memory usage**_ though if you are hashing / aggregating something with high cardinality as each key will generate 10s of extra bytes based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  and 8 bytes for boxing the inner Vec of the aggregation per row.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of GroupByScalar, improve benchmarks

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733273993


   https://issues.apache.org/jira/browse/ARROW-10722


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217


   @jorgecarleitao Not really on performance as current benchmarks / queries show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can be made faster itself. I think one part would be to create a key that can be hashed faster. Now the hashing algorithm hashes each individual GroupByValue instead of working on a byte array. The latter is in principle faster. Also, some specialized code could also be made for hashing based on 1 column only.
   
   The current changes have a larger impact on _**memory usage**_ though if you are hashing / aggregating something with high cardinality as each key will generate extra bytes based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  and 8 bytes for boxing the inner Vec of the aggregation per row. So in a extreme case 32 bytes per row extra where every key is unique.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of GroupByScalar, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733642960


   Some further tweaks, uses `Box<[GroupByScalar]>` as key now to avoid a further 8 bytes in the key.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217


   @jorgecarleitao Not really on performance as current benchmarks / queries show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can be made faster itself. I think one part would be to create a key that can be hashed faster. Now the hashing algorithm hashes each value individual GroupByValue instead of working on a byte array. The latter one could in principle be faster. Some specialized code could also be made for hashing based on 1 column only.
   
   It can have a larger impact on _**memory usage**_ though if you are hashing / aggregating something with high cardinality as each key will generate 10s of extra bytes per row based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  and 8 bytes for boxing the inner Vec of the aggregation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217


   @jorgecarleitao Not really on performance as current benchmarks / queries show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can be made faster itself. I think one part would be to create a key that can be hashed faster. Now the hashing algorithm hashes each individual GroupByValue instead of working on a byte array. The latter one could in principle be faster. Some specialized code could also be made for hashing based on 1 column only.
   
   The current changes have a larger impact on _**memory usage**_ though if you are hashing / aggregating something with high cardinality as each key will generate 10s of extra bytes based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  and 8 bytes for boxing the inner Vec of the aggregation per row.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217


   @jorgecarleitao Not really on performance as current benchmarks / queries show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can be made faster itself. I think one part would be to create a key that can be hashed faster. Now the hashing algorithm hashes each value individual GroupByValue instead of working on a byte array. The latter one could in principle be faster. Some specialized code could also be made for hashing based on 1 column only.
   
   The current changes have a larger impact on _**memory usage**_ though if you are hashing / aggregating something with high cardinality as each key will generate 10s of extra bytes based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  and 8 bytes for boxing the inner Vec of the aggregation per row.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-739682995


   Can you rebase this, so that I can merge it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733815809


   Genuinely curious: does the key size has such a large impact? Or is there any memory constraints that you are looking for?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-739702160


   Did also some test against master, it looks like this PR also has some small runtime savings on join queries, I think mostly because of `.clone()` being cheaper now. Will do some profiling later.
   Would be nice to have some criterion benchmarks later to cover joins too & I think it would make sense to have some benchmarks / profiling covering maximum memory usage too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-739694900


   @jorgecarleitao thanks, updated against master!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217


   @jorgecarleitao Not really on performance as current benchmarks / queries show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can be made faster itself. I think one part would be to create a key that can be hashed faster. Now the hashing algorithm hashes each individual GroupByValue instead of working on a byte array. The latter is in principle faster. Some specialized code could also be made for hashing based on 1 column only.
   
   The current changes have a larger impact on _**memory usage**_ though if you are hashing / aggregating something with high cardinality as each key will generate extra bytes based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  and 8 bytes for boxing the inner Vec of the aggregation per row. So in a extreme case 32 bytes per row extra where every key is unique.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-740066372


   I also have a follow-up on this PR which converts the key for the values to `Vec<u8>` for the hash joins which is again faster as well. Also reading up a bit on hash join algorithms... It looks like resolving the hash collisions is mostly done afterwards instead of in a hash map directly. I guess that could also be a bit easier, as we don't have to copy the data in the hashmap itself but can keep them in the columns/rows themselves. Would be interesting if we can hash a list of columns much more efficiently this way (e.g. hashing whole columns instead).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217


   @jorgecarleitao Not really on performance as current benchmarks / queries show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can be made faster itself. I think one part would be to create a key that can be hashed faster. Now the hashing algorithm hashes each individual GroupByValue instead of working on a byte array. The latter is in principle faster. Some specialized code could also be made for hashing based on 1 column only.
   
   The current changes have a larger impact on _**memory usage**_ though if you are hashing / aggregating something with high cardinality as each key will generate extra bytes based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  and 8 bytes for boxing the inner Vec of the aggregation per row. So in a extreme case ~40 bytes per row extra where every key is unique.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217


   @jorgecarleitao Not really on performance as current benchmarks / queries show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can be made faster itself. I think one part would be to create a key that can be hashed faster. Now the hashing algorithm hashes each individual GroupByValue instead of working on a byte array. The latter is in principle faster. Some specialized code could also be made for hashing based on 1 column only.
   
   The current changes have a larger impact on _**memory usage**_ though if you are hashing / aggregating something with high cardinality as each key will generate 10s of extra bytes based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  and 8 bytes for boxing the inner Vec of the aggregation per row.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217


   @jorgecarleitao Not really on performance as current benchmarks / queries show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can be made faster itself. I think one part would be to create a key that can be hashed faster. Now the hashing algorithm hashes each value individual GroupByValue instead of working on a byte array. The latter one could in principle be faster. Some specialized code could also be made for hashing based on 1 column only.
   
   It can have a larger impact on _**memory usage**_ though if you are hashing / aggregating something with high cardinality as each row will generate 10s of extra bytes per row based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  and 8 bytes for boxing the inner Vec of the aggregation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217


   @jorgecarleitao Not really on performance as current benchmarks / queries show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can be made faster itself. I think one part would be to create a key that can be hashed faster. Now the hashing algorithm hashes each value individual GroupByValue instead of working on a byte array. The latter one could in principle be faster. Some specialized code could also be made for hashing based on 1 column only.
   
   It can have a larger impact on _**memory usage**_ though if you are hashing / aggregating something with high cardinality as each key will generate 10s of extra bytes based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  and 8 bytes for boxing the inner Vec of the aggregation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org