You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Andrew Lamb (Jira)" <ji...@apache.org> on 2021/04/26 13:26:02 UTC

[jira] [Commented] (ARROW-11112) [Rust][DataFusion] Implement vectorized hashing

    [ https://issues.apache.org/jira/browse/ARROW-11112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332337#comment-17332337 ] 

Andrew Lamb commented on ARROW-11112:
-------------------------------------

Migrated to github: https://github.com/apache/arrow-datafusion/issues/142

> [Rust][DataFusion] Implement vectorized hashing
> -----------------------------------------------
>
>                 Key: ARROW-11112
>                 URL: https://issues.apache.org/jira/browse/ARROW-11112
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Daniël Heres
>            Priority: Major
>
> Currently, the approach of the join and hash aggregates is to create a key individually from the row values. However, this is far from ideal, as it doesn't utilize the cache vectorized nature of Arrow, but instead copies data into a vec, traverses multiple arrays in the inner loop, etc.
> This blog post has a summary of an approach to do this in a vectorized way.
> [https://www.cockroachlabs.com/blog/vectorized-hash-joiner/]
>  
> TBD:
> We should decide/find out whether it still makes sense to use rust `HashMap` (with () as key?) or whether to create an own? Benefit of using hashmap is that there is an API, can resize automatically, and uses SIMD, and also exposes some lower level bits we can use here.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)