You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Andrew Lamb (Jira)" <ji...@apache.org> on 2021/02/15 20:37:00 UTC

[jira] [Created] (ARROW-11635) [Rust] [DataFusion] Improve performance for grouping/hashing on dictionary encoded data

Andrew Lamb created ARROW-11635:
-----------------------------------

             Summary: [Rust] [DataFusion] Improve performance for grouping/hashing on dictionary encoded data
                 Key: ARROW-11635
                 URL: https://issues.apache.org/jira/browse/ARROW-11635
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Andrew Lamb


I am recording this for posterity / potential for someone else to help if they want:

While adding support for GROUP BY hash, [~jorgecarleitao] had some great suggestions
https://github.com/apache/arrow/pull/9233#issuecomment-762174671

The initial GROUP BY implementation hashes the actual value of the dictionary (aka looks up the underlying value). For the common case such as when the dictionary contains strings, this will likely do much more work than is necessary. 

In the common case we should be able to hash the dictionary indexes directly, or  possibly skip hashing entirely and build an aggregate table directly from the indexes  -- this would work incredibly well for low cardinality string columns

What makes it tricky is that we would have to handle the case where the dictionary itself is not the same across all record batches (and thus indexes in one record batch may not correspond to the same value in another)

Some possibly implementation ideas are:
Implement a special case for a shared dictionary across all input record batches, and have code to switch back to the more general case (hash table) if the dictionary ever changes.

Alternately, we could hold a hash table (or equivalent) for each distinct dictionary we saw and merge them all at the end. 

The second approach likely would likely be the fastest, but also would potentially consume the most resources



--
This message was sent by Atlassian Jira
(v8.3.4#803005)