You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/10/02 17:32:01 UTC
[jira] [Commented] (DRILL-5816) Hash function produces skewed
results on String values with same leading prefix
[ https://issues.apache.org/jira/browse/DRILL-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16188472#comment-16188472 ]
ASF GitHub Bot commented on DRILL-5816:
---------------------------------------
Github user asfgit closed the pull request at:
https://github.com/apache/drill/pull/959
> Hash function produces skewed results on String values with same leading prefix
> -------------------------------------------------------------------------------
>
> Key: DRILL-5816
> URL: https://issues.apache.org/jira/browse/DRILL-5816
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Sorabh Hamirwasia
> Assignee: Sorabh Hamirwasia
> Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Reported by [~amansinha100]
> Hashing of string values (for the hash exchange) could produce substantial skew for certain types of strings that have the same leading prefix.
> Here's the sample data: (note all strings begin with 'mscId=' followed by numeric values)
> 0: jdbc:drill:drillbit=10.10.103.111> select a from dfs.tmp.vv3 limit 20;
> +---------------------+
> | a |
> +---------------------+
> | mscId=100139170495 |
> | mscId=100103806655 |
> | mscId=100229137840 |
> | mscId=100362859440 |
> | mscId=100032583600 |
> | mscId=100125021360 |
> | mscId=100243775920 |
> | mscId=100152820405 |
> | mscId=100084724405 |
> | mscId=100297398970 |
> | mscId=100059560890 |
> | mscId=100106108090 |
> | mscId=100032092090 |
> | mscId=100029460410 |
> | mscId=100110390995 |
> | mscId=100019105235 |
> | mscId=100354644435 |
> | mscId=100288523475 |
> | mscId=100214507475 |
> | mscId=100296418515 |
> +---------------------+
> 20 rows selected (0.33 seconds)
> Here's the hash values using the hash function that Drill uses for the HashToRandomExchange (note that they are all even numbers):
> 0: jdbc:drill:drillbit=10.10.103.111> select hash32AsDouble(a, 1301011) from dfs.tmp.vv3 limit 20;
> +--------------+
> | EXPR$0 |
> +--------------+
> | 1180062632 |
> | -1322734784 |
> | 2096701320 |
> | 2075007536 |
> | -1970336592 |
> | 1614574192 |
> | 1592743936 |
> | -1053691072 |
> | -689805200 |
> | 1893061072 |
> | 1660328376 |
> | 1852126136 |
> | 1927731344 |
> | 616840056 |
> | -1997249184 |
> | 1588717872 |
> | 193019624 |
> | 880839008 |
> | 1879415496 |
> | 1726850216 |
> +--------------+
> 20 rows selected (0.311 seconds)
> Doing a mod 56 only produces 1 distinct value, which indicates the skew:
> 0: jdbc:drill:drillbit=10.10.103.111> select distinct mod(hash32AsDouble(a, 1301011), 56) from dfs.tmp.vv3 limit 20;
> +---------+
> | EXPR$0 |
> +---------+
> | 0 |
> +---------+
> 1 row selected (1.041 seconds)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)