You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Steve Scaffidi (JIRA)" <ji...@apache.org> on 2015/07/10 19:47:04 UTC

[jira] [Created] (HADOOP-12217) hashCode in DoubleWritable returns same value for many numbers

Steve Scaffidi created HADOOP-12217:
---------------------------------------

             Summary: hashCode in DoubleWritable returns same value for many numbers
                 Key: HADOOP-12217
                 URL: https://issues.apache.org/jira/browse/HADOOP-12217
             Project: Hadoop Common
          Issue Type: Bug
          Components: io
    Affects Versions: 2.7.1, 2.7.0, 2.6.0, 2.5.2, 2.5.1, 2.4.1, 2.5.0, 2.4.0, 2.3.0, 2.2.0, 2.0.6-alpha, 2.1.1-beta, 0.23.11, 0.23.10, 0.23.9, 2.0.5-alpha, 1.2.1, 0.23.8, 2.0.4-alpha, 2.1.0-beta, 0.23.7, 1.1.2, 0.23.6, 0.23.5, 2.0.3-alpha, 0.23.4, 2.0.2-alpha, 2.0.1-alpha, 2.0.0-alpha, 0.23.3, 0.23.1, 0.23.0, 0.22.0, 0.21.0, 1.2.0, 1.1.1, 1.1.0, 1.0.4, 1.0.3, 1.0.2, 1.0.1, 1.0.0, 0.20.205.0, 0.20.204.0, 0.20.203.0, 0.20.2, 0.20.1, 0.20.0, 0.19.1, 0.19.0, 0.18.3, 0.18.2, 0.18.1, 0.18.0
            Reporter: Steve Scaffidi


Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the keys in a HashMap results in abysmal performance, due to hash code collisions.

I discovered this when testing the latest version of Hive and certain mapjoin queries were exceedingly slow.

Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable that overrode used to override hashCode() with a correct implementation, but for some reason they recently removed that code, so it now uses the incorrect hashCode() method inherited from Hadoop's DoubleWritable.

It appears that this bug has been there since DoubleWritable was created(!) so I can understand if fixing it is impractical due to the possibility of breaking things down-stream, but I can't think of anything that *should* break, off the top of my head.

Searching JIRA, I found several related tickets, which may be useful for some historical perspective: HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629, HIVE-7041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)