You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Sorabh Hamirwasia (JIRA)" <ji...@apache.org> on 2017/09/25 02:30:00 UTC
[jira] [Created] (DRILL-5816) Hash function produces skewed results
on String values with same leading prefix
Sorabh Hamirwasia created DRILL-5816:
----------------------------------------
Summary: Hash function produces skewed results on String values with same leading prefix
Key: DRILL-5816
URL: https://issues.apache.org/jira/browse/DRILL-5816
Project: Apache Drill
Issue Type: Bug
Reporter: Sorabh Hamirwasia
Assignee: Sorabh Hamirwasia
Fix For: 1.12.0
Reported by [~amansinha100]
Hashing of string values (for the hash exchange) could produce substantial skew for certain types of strings that have the same leading prefix.
Here's the sample data: (note all strings begin with 'mscId=' followed by numeric values)
0: jdbc:drill:drillbit=10.10.103.111> select a from dfs.tmp.vv3 limit 20;
+---------------------+
| a |
+---------------------+
| mscId=100139170495 |
| mscId=100103806655 |
| mscId=100229137840 |
| mscId=100362859440 |
| mscId=100032583600 |
| mscId=100125021360 |
| mscId=100243775920 |
| mscId=100152820405 |
| mscId=100084724405 |
| mscId=100297398970 |
| mscId=100059560890 |
| mscId=100106108090 |
| mscId=100032092090 |
| mscId=100029460410 |
| mscId=100110390995 |
| mscId=100019105235 |
| mscId=100354644435 |
| mscId=100288523475 |
| mscId=100214507475 |
| mscId=100296418515 |
+---------------------+
20 rows selected (0.33 seconds)
Here's the hash values using the hash function that Drill uses for the HashToRandomExchange (note that they are all even numbers):
0: jdbc:drill:drillbit=10.10.103.111> select hash32AsDouble(a, 1301011) from dfs.tmp.vv3 limit 20;
+--------------+
| EXPR$0 |
+--------------+
| 1180062632 |
| -1322734784 |
| 2096701320 |
| 2075007536 |
| -1970336592 |
| 1614574192 |
| 1592743936 |
| -1053691072 |
| -689805200 |
| 1893061072 |
| 1660328376 |
| 1852126136 |
| 1927731344 |
| 616840056 |
| -1997249184 |
| 1588717872 |
| 193019624 |
| 880839008 |
| 1879415496 |
| 1726850216 |
+--------------+
20 rows selected (0.311 seconds)
Doing a mod 56 only produces 1 distinct value, which indicates the skew:
0: jdbc:drill:drillbit=10.10.103.111> select distinct mod(hash32AsDouble(a, 1301011), 56) from dfs.tmp.vv3 limit 20;
+---------+
| EXPR$0 |
+---------+
| 0 |
+---------+
1 row selected (1.041 seconds)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)