You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Claude Warren (Jira)" <ji...@apache.org> on 2019/11/24 09:09:00 UTC

[jira] [Commented] (CODEC-268) Deprecate methods in MurmurHash3 with no equivalent in the reference c++ source.

    [ https://issues.apache.org/jira/browse/CODEC-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16981011#comment-16981011 ] 

Claude Warren commented on CODEC-268:
-------------------------------------

I agree with most of this resolution steps in this issue with the following exceptions:

The older implementations should not be removed.  There are cases where the use of the "broken" implementation can not be easily changed.  For example if the MurmurHash is ues to build a hash of a file and that hash is then used to calculate a storage location.  An application using the hash in this way can not simply change the hash as all previously stored files will be unreachable.  Therefore access to the old implementations should remain for some period of time.

I think that the older methods should be deprecated and developers pointed to he new implementations by means of javadoc.

But that the older implementations must remain to support earlier release that are in the wild. 


> Deprecate methods in MurmurHash3 with no equivalent in the reference c++ source.
> --------------------------------------------------------------------------------
>
>                 Key: CODEC-268
>                 URL: https://issues.apache.org/jira/browse/CODEC-268
>             Project: Commons Codec
>          Issue Type: Wish
>    Affects Versions: 1.13
>            Reporter: Alex Herbert
>            Assignee: Alex Herbert
>            Priority: Minor
>
> The following methods have no equivalent in the [MurmurHash3 c++ source code|https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp]
> {code:java}
>     public static long hash64(final long data) {
>     public static long hash64(final int data) {
>     public static long hash64(final short data) {
>     public static long hash64(final byte[] data) {
>     public static long hash64(final byte[] data, final int offset, final int length) {
>     public static long hash64(final byte[] data, final int offset, final int length, final int seed) {
> {code}
> They are documented to return the upper 64-bits of the 128-bit hash method. This is false as the code neglects to mix in alternating blocks of 64-bits with the corresponding other hash that would be built in the 128-bit method. Thus this passes using a NotEquals check:
> {code:java}
> @Test
> public void testHash64InNotEqualToHash128() {
>     for (int i = 0; i < 32; i++) {
>         final byte[] bytes = Arrays.copyOf(RANDOM_BYTES, i);
>         final long h1 = MurmurHash3.hash64(bytes);
>         final long[] hash = MurmurHash3.hash128(bytes);
>         Assert.assertNotEquals("Did not expect hash64 to match upper bits of hash128", hash[0], h1);
>         Assert.assertNotEquals("Did not expect hash64 to match lower bits of hash128", hash[1], h1);
>     }
> }
> {code}
> The following methods are convenience methods that use an arbitrary default seed:
> {code:java}
>     public static final int DEFAULT_SEED = 104729;
>     public static int hash32(final long data1, final long data2) {
>     public static int hash32(final long data1, final long data2, final int seed) {
>     public static int hash32(final long data) {
>     public static int hash32(final long data, final int seed) {
> {code}
> Although they match calling hash32() with the equivalent bytes this assumes big-endian format. The reference c++ code is designed for little endian. If you hash the corresponding values using a Google Guava implementation with the same seed then they do not match as Guava faithfully converts primitives as little endian. Also note that the available methods for hashing primitives are not the same as those in hash64.
> These methods thus make very little sense and should be removed as an unwanted legacy from the port from Apache Hive.
> The following methods are convenience methods that use default encoding via String.getBytes():
> {code:java}
>     public static int hash32(final String data) {
>     public static long[] hash128(final String data) {
> {code}
> These effectively do nothing and save 0 lines of code for the caller:
> {code:java}
>     String s;
>     int hash1 = MurmurHash3.hash32(s);
>     int hash2 = MurmurHash3.hash32(s.getBytes());
>     assert hash1 == hash2;
> {code}
> This is not even used:
> {code:java}
>     public static final long NULL_HASHCODE = 2862933555777941757L;
> {code}
> The following methods for hash32, hash64 and hash128 are not consistent with the available arguments:
> {code:java}
>     public static int hash32(final byte[] data) {
>     public static int hash32(final String data) {
>     public static int hash32(final byte[] data, final int length) {
>     public static int hash32(final byte[] data, final int length, final int seed) {
>     public static int hash32(final byte[] data, final int offset, final int length, final int seed) {
>     public static long hash64(final byte[] data) {
>     // Two int arguments specify (offset, length) not (length, seed)
>     public static long hash64(final byte[] data, final int offset, final int length) {
>     public static long hash64(final byte[] data, final int offset, final int length, final int seed) {
>     public static long[] hash128(final byte[] data) {
>     public static long[] hash128(final String data) {
>     // No other (offset, length, seed) combinations
>     public static long[] hash128(final byte[] data, final int offset, final int length, final int seed) {
> {code}
> I would suggest deprecating:
>  * All hash64 methods
>  * The default seed
>  * The null hashcode
>  * The helper methods that hash primitives
>  * The helper methods that hash String using default encoding
>  * Those methods that accept various combinations of offset, length and seed
> This would leave a cleaner API:
> {code:java}
>     public static int hash32(final byte[] data) {
>     public static int hash32(final byte[] data, final int offset, final int length, final int seed) {
>     public static long[] hash128(final byte[] data) {
>     public static long[] hash128(final byte[] data, final int offset, final int length, final int seed) {
> {code}
> There are outstanding bugs in CODEC-264 and CODEC-267 for sign extension errors in hash32 and hash128. These should be fixed using a minimal API with names that reference the x86 and x64 names from the MurmurHash3 reference c++ code:
> I would argue that the default seed should be zero (it is zero in Google Guava and Python mmh3 implementations of MurmurHash3), not an arbitrary random int.
> {code:java}
>     public static int hash32x86(final byte[] data) {
>     public static int hash32x86(final byte[] data, final int offset, final int length, final int seed) {
>     public static long[] hash128x64(final byte[] data) {
>     public static long[] hash128x64(final byte[] data, final int offset, final int length, final int seed) {
> {code}
> All deprecated methods should be referred to use these instead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)