You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by "Chen Luo (Jira)" <ji...@apache.org> on 2020/09/26 16:39:00 UTC

[jira] [Created] (ASTERIXDB-2783) Serious Hash Collision in Hash Join/Groupby

Chen Luo created ASTERIXDB-2783:
-----------------------------------

             Summary: Serious Hash Collision in Hash Join/Groupby
                 Key: ASTERIXDB-2783
                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2783
             Project: Apache AsterixDB
          Issue Type: Bug
          Components: RT - Runtime
            Reporter: Chen Luo
            Assignee: Chen Luo


The current implementation of Hash Join/Groupby suffers from a serious hash collision problem. In these two operators, we first used the hash exchange operator to assign each key to an NC partition (hash1(key)%P, where P is the number of partitions), and then build a hash table at each NC partition (hash2(key)%N, where N is the hash table size). However, our implementation currently uses the same hash function for both steps (hash1 == hash2). This is simply incorrect and can lead to a lot of hash collisions.

To see this problem, consider what happens to NC partition 0. After hash partitioning, for each key assigned to this partition, we know that hash(key)%P == 0. Unless the greatest common divisor of P and N is 0, there will be a lot hash collisions! For example, suppose P = 16 and N is a multiple of 4. Since hash(key) is a multiple of 16, we know for sure that hash(key)%N can be multiples of 4 as well! This implies that all slots that are multiples of 1,2,3 will always be empty, but all entries will be inserted into slots that are multiples of 4.

 

To fix this problem, we can simply use a different hash function for hash join/groupby.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)