You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2015/11/21 20:17:10 UTC
[jira] [Commented] (DRILL-4119) Skew in hash distribution for
varchar (and possibly other) types of data
[ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15020627#comment-15020627 ]
Aman Sinha commented on DRILL-4119:
-----------------------------------
The problem comes from the cast to integer after computing the hash value: castInt(hash64AsDouble($0)). I verified that the hash64AsDouble produces good distribution for the hash value but the cast loses the precision. The hash-based operators all use a 32 bit hash value (for smaller memory footprint and related reasons), so we do need the integer value but should preserve as much as possible the underlying distribution.
I am fixing this by ensuring that instead of casting to int, the underlying hash function itself computes a 32 bit hash value by first computing the 64 bit hash followed by XORing the most significant 4 bytes with the least significant 4 bytes. The current hash32 functions in XXHash.java (for example, see https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/XXHash.java#L198) are currently calling hash64 and then casting to int. I am proposing to change these to use the above mechanism of combining the msb and lsb bytes. The cpu cost should be relatively small.
> Skew in hash distribution for varchar (and possibly other) types of data
> ------------------------------------------------------------------------
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
> Issue Type: Bug
> Components: Functions - Drill
> Affects Versions: 1.3.0
> Reporter: Aman Sinha
> Assignee: Aman Sinha
>
> We are seeing substantial skew for an Id column that contains varchar data of length 32. It is easily reproducible by a group-by query:
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02 HashAgg(group=[{0}])
> 01-03 Project(SomeId=[$0])
> 01-04 HashToRandomExchange(dist0=[[$0]])
> 02-01 UnorderedMuxExchange
> 03-01 Project(SomeId=[$0], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02 HashAgg(group=[{0}])
> 03-03 Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type:
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)