You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/04 11:59:15 UTC

[GitHub] [arrow] Dandandan opened a new pull request #8832: ARROW-10807: [Rust][DataFusion] Avoid double hashing

Dandandan opened a new pull request #8832:
URL: https://github.com/apache/arrow/pull/8832


   This PR shows one area for improvement in the hash join. Currently the Vec is hashed twice by first looking up the key, and then inserting or mutating the value.
   Using the unstable `hash_raw_entry` api we can avoid this, and get some speedup (mostly in the hash join).
   
   We could also use the hashbrown crate instead to avoid needing a nightly compiler.
   
   This brings the query 12 times down from > 1500ms locally to:
   ```
   Query 12 iteration 0 took 1425 ms
   Query 12 iteration 1 took 1427 ms
   Query 12 iteration 2 took 1481 ms
   Query 12 iteration 3 took 1465 ms
   Query 12 iteration 4 took 1469 ms
   Query 12 iteration 5 took 1455 ms
   Query 12 iteration 6 took 1482 ms
   Query 12 iteration 7 took 1478 ms
   Query 12 iteration 8 took 1480 ms
   Query 12 iteration 9 took 1463 ms
   ```
   
   FYI @jorgecarleitao @andygrove 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8832: ARROW-10807: [Rust][DataFusion] Avoid double hashing

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8832:
URL: https://github.com/apache/arrow/pull/8832#issuecomment-739684671


   @alamb @andygrove , this introduces a new dependency to DataFusion. Is that ok for you?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao closed pull request #8832: ARROW-10807: [Rust][DataFusion] Avoid double hashing

Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #8832:
URL: https://github.com/apache/arrow/pull/8832


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8832: ARROW-10807: [Rust][DataFusion] Avoid double hashing

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8832:
URL: https://github.com/apache/arrow/pull/8832#issuecomment-738912832


   Is ready for review now @jorgecarleitao @andygrove 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8832: ARROW-10807: [Rust][DataFusion] Avoid double hashing

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8832:
URL: https://github.com/apache/arrow/pull/8832#issuecomment-738751412


   https://issues.apache.org/jira/browse/ARROW-10807


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8832: ARROW-10807: [Rust][DataFusion] Avoid double hashing

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8832:
URL: https://github.com/apache/arrow/pull/8832#issuecomment-740086807


   Some additional context: in the future, when the feature is stabilized,  the hashbrown dependency can be dropped again. I think the raw entry api will be useful for future optimizations / hash join algorithms as well, for example it also allows for putting your own keys instead of based on a value.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org