You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/09/28 18:46:00 UTC

[jira] [Commented] (ASTERIXDB-2783) Serious Hash Collision in Hash Join/Groupby

    [ https://issues.apache.org/jira/browse/ASTERIXDB-2783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203457#comment-17203457 ] 

ASF subversion and git services commented on ASTERIXDB-2783:
------------------------------------------------------------

Commit 3f75e5b627493eae80c89f8eaa0f075fb31ab3af in asterixdb's branch refs/heads/master from luochen
[ https://gitbox.apache.org/repos/asf?p=asterixdb.git;h=3f75e5b ]

[ASTERIXDB-2783] Fix hash collision for hash join/groupby

- user model changes: no
- storage format changes: no
- interface changes: no

Details:
- Use a random seed for hash join/groupby to avoid hash collisions
with the hash partitioning
- Slightly increase the join memory so that the large object join
test case can still pass.

Change-Id: If2aa02384129293e80015efc3d1f60b57f98909c
Reviewed-on: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/8123
Integration-Tests: Jenkins <je...@fulliautomatix.ics.uci.edu>
Tested-by: Jenkins <je...@fulliautomatix.ics.uci.edu>
Reviewed-by: Dmitry Lychagin <dm...@couchbase.com>


> Serious Hash Collision in Hash Join/Groupby
> -------------------------------------------
>
>                 Key: ASTERIXDB-2783
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2783
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: RT - Runtime
>            Reporter: Chen Luo
>            Assignee: Chen Luo
>            Priority: Critical
>
> The current implementation of Hash Join/Groupby suffers from a serious hash collision problem. In these two operators, we first used the hash exchange operator to assign each key to an NC partition (hash1(key)%P, where P is the number of partitions), and then build a hash table at each NC partition (hash2(key)%N, where N is the hash table size). However, our implementation currently uses the same hash function for both steps (hash1 == hash2). This is simply incorrect and can lead to a lot of hash collisions.
> To see this problem, consider what happens to NC partition 0. After hash partitioning, for each key assigned to this partition, we know that hash(key)%P == 0. Unless the greatest common divisor of P and N is 0, there will be a lot hash collisions! For example, suppose P = 16 and N is a multiple of 4. Since hash(key) is a multiple of 16, we know for sure that hash(key)%N can be multiples of 4 as well! This implies that all slots that are multiples of 1,2,3 will always be empty, but all entries will be inserted into slots that are multiples of 4.
>  
> To fix this problem, we can simply use a different hash function for hash join/groupby.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)