You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2015/11/24 18:33:11 UTC
[jira] [Comment Edited] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

    [ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024913#comment-15024913 ] 

Aman Sinha edited comment on DRILL-4119 at 11/24/15 5:32 PM:
-------------------------------------------------------------

Our hash64 implementation looks similar to the original one but I haven't done enough analysis to say they are exactly the same.  The only way to check is through testing.  Here are 2 values and their corresponding hash from the original (note, for some reason the command line utility xxh64sum does not read multiple lines from a file, so I had to break up the values into separate files): 
{noformat}
$ cat sample1.csv
1a883d005e0ce003b918d737ac697e7c

$ cat sample2.csv
e4b4388e8865819126cb0e4dcaa7261d

$ ./xxh64sum sample1.csv
1213a50f060e0659  sample1.csv

$ ./xxh64sum sample2.csv
e0658433041ce9aa  sample2.csv
{noformat}

These values don't match the value I am getting from Drill  after doing the conversion of the long to hex (I used Long.toHexString() method in debugger to convert), so it is possible something may have gotten lost in translation. 


was (Author: amansinha100):
Our hash64 implementation looks similar to the original one but I haven't done enough analysis to say they are exactly the same.  The only way to check is through testing.  Here are 2 values and their corresponding hash from the original (note, for some reason the command line utility xxh64sum does not read multiple lines from a file, so I had to break up the values into separate files): 
{noformat}
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat > sample2.csv
e4b4388e8865819126cb0e4dcaa7261d
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample1.csv
1a883d005e0ce003b918d737ac697e7c
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample2.csv
e4b4388e8865819126cb0e4dcaa7261d
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample1.csv
1213a50f060e0659  sample1.csv
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample2.csv
e0658433041ce9aa  sample2.csv
{noformat}

These values don't match the value I am getting from Drill  after doing the conversion of the long to hex (I used Long.toHexString() method in debugger to convert), so it is possible something may have gotten lost in translation. 

> Skew in hash distribution for varchar (and possibly other) types of data
> ------------------------------------------------------------------------
>
>                 Key: DRILL-4119
>                 URL: https://issues.apache.org/jira/browse/DRILL-4119
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.3.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>             Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02          HashAgg(group=[{0}])
> 01-03            Project(SomeId=[$0])
> 01-04              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(SomeId=[$0], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02                    HashAgg(group=[{0}])
> 03-03                      Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)