You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Grant Ingersoll (Created) (JIRA)" <ji...@apache.org> on 2011/11/02 14:05:32 UTC

[jira] [Created] (MAHOUT-862) MurmurHash 3.0

MurmurHash 3.0
--------------

                 Key: MAHOUT-862
                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
             Project: Mahout
          Issue Type: Improvement
            Reporter: Grant Ingersoll
            Assignee: Grant Ingersoll
            Priority: Minor


Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/

It's a port of https://sites.google.com/site/murmurhash/ which says: 
{quote}
(I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.

In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.

Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
{quote}

See also http://code.google.com/p/smhasher/

We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by Ted Dunning <te...@gmail.com>.
Great.

Btw, I changed the code quite a bit in my previous ports to use
ByteBuffers.  That made it much more portable and considerably faster than
the naive port I started with.

I would suggest similar efforts be made for this code.

On Wed, Nov 2, 2011 at 8:01 AM, Grant Ingersoll (Commented) (JIRA) <
jira@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142200#comment-13142200]
>
> Grant Ingersoll commented on MAHOUT-862:
> ----------------------------------------
>
> Committed revision 1196616.
>
> I'll leave open for a day or two so others can review.
>
> > MurmurHash 3.0
> > --------------
> >
> >                 Key: MAHOUT-862
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
> >             Project: Mahout
> >          Issue Type: Improvement
> >            Reporter: Grant Ingersoll
> >            Assignee: Grant Ingersoll
> >            Priority: Minor
> >         Attachments: MAHOUT-862.patch
> >
> >
> > Yonik has ported an implementation of MurmurHash 3.0 and put it in the
> public domain:
> http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> > It's a port of https://sites.google.com/site/murmurhash/ which says:
> > {quote}
> > (I reserve the right to tweak the constants after people have had a
> chance to bang on it). Murmur3 has better performance than MurmurHash2, no
> repetition flaw, comes in 32/64/128-bit versions for both x86 and x64
> platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes
> per second on my 3 gigahertz Core 2.
> > In addition, the library of test code that I use to test MurmurHash
> (called SMHasher) has been released - it's still rough (and will only
> compile under VC++ at the moment), but it contains everything needed to
> verify hash functions of arbitrary output bit-lengths.
> > Murmur3 and all future versions will be hosted on Google Code here -
> http://code.google.com/p/smhasher/ - you can access the codebase via the
> 'Source' tab at the top.
> > {quote}
> > See also http://code.google.com/p/smhasher/
> > We should add support for it and hook into MinHash
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

[jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by "Yonik Seeley (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142261#comment-13142261 ] 

Yonik Seeley commented on MAHOUT-862:
-------------------------------------

The test case I used is here:
https://github.com/yonik/java_util/tree/master/test/util/hash

bq. Also, this could be rewritten to actually make use of Unsafe if it's available -- I bet with larger memory chunks the speed gain would be noticeable.

For larger memory chunks I'd go with the 64 bit variant of MurmurHash.
                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142281#comment-13142281 ] 

Dawid Weiss commented on MAHOUT-862:
------------------------------------

Yes, sure, 64-bit is fine, but still: I was wondering what kind of machine code is generated for accessing array indexes -- if there is bounds checking this will contribute to the overall time (compared to the C version, for example, or the unsafe mem. access).

I'm not advocating to make it more complex by using Unsafe, I was just curious about the difference and sort of thinking aloud. 
                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142278#comment-13142278 ] 

Ted Dunning commented on MAHOUT-862:
------------------------------------

{quote}
The test case I used is here:
https://github.com/yonik/java_util/tree/master/test/util/hash
{quote}

Excellent.  We should use that test.  My grump was that *Mahout* didn't have the test, not that I knew the test didn't exist (I knew little, really).
                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142324#comment-13142324 ] 

Grant Ingersoll commented on MAHOUT-862:
----------------------------------------

I committed the test.
                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145276#comment-13145276 ] 

Dawid Weiss commented on MAHOUT-862:
------------------------------------

Checked the source code of openjdk out of curiosity. Mapped buffers (for example longbuffer over bytebuffer) are implemented with a mix of various intrisics. For example in:
{code}
    public LongBuffer put(int i, long x) {
        Bits.putLongL(bb, ix(checkIndex(i)), x);
        return this;
    }
{code}

checkIndex is an intrinsic and putLongL uses internal unchecked _put methods, so the sequence:
{code}
    static void putLongL(ByteBuffer bb, int bi, long x) {
        bb._put(bi + 7, long7(x));
        bb._put(bi + 6, long6(x));
        bb._put(bi + 5, long5(x));
        bb._put(bi + 4, long4(x));
        bb._put(bi + 3, long3(x));
        bb._put(bi + 2, long2(x));
        bb._put(bi + 1, long1(x));
        bb._put(bi    , long0(x));
    } 
{code}
is pretty much index-checked once. Bits contains lots of other accesses to Unsafe and I bet this is most of the speedup over normal Java code.

I didn't have the time to inspect assembly dumps to verify for sure, but the above should pretty much address the questions raised earlier in this thread. [Applies to SUN/Oracle HotSpot only, didn't check other VMs.]
                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142275#comment-13142275 ] 

Ted Dunning commented on MAHOUT-862:
------------------------------------

{quote}
Are speed gains on bytebuffers a result of unsafe underlying buffer accesses? 
{quote}
No.  The speed gains are largely because getLong works really well on ByteBuffers (certainly better than byte by byte shift and mask code).  That then allows the JVM to do better loop optimizations (I think).

                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142372#comment-13142372 ] 

Ted Dunning commented on MAHOUT-862:
------------------------------------

{quote}
That's probably because it's an intrinsic reading in 8 bytes at a time without bounds checking, but I'd have to confirm that by looking at the jitted code dump (or in openjdk code). 
{quote}
I imagine that it is an intrinsic and that it has less bounds checking, but I am sure that it is safe from over-run (i.e. has sufficient bounds-checking).
                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142258#comment-13142258 ] 

Dawid Weiss commented on MAHOUT-862:
------------------------------------

Are speed gains on bytebuffers a result of unsafe underlying buffer accesses? A simple array loop with predictable ends should be optimized pretty much the same way though (boundary checks only, followed by no-checks accesses); wonder where the gain comes from then?

Also, this could be rewritten to actually make use of Unsafe if it's available -- I bet with larger memory chunks the speed gain would be noticeable.
                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-862) MurmurHash 3.0

Posted by "Grant Ingersoll (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-862:
-----------------------------------

    Attachment: MAHOUT-862.patch

Here's a patch that adds MurmurHash3.  Tests pass, but I'm not an expert in this stuff, so if someone else wants to take a peak, that would be great.
                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142200#comment-13142200 ] 

Grant Ingersoll commented on MAHOUT-862:
----------------------------------------

Committed revision 1196616.

I'll leave open for a day or two so others can review.
                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142199#comment-13142199 ] 

Grant Ingersoll commented on MAHOUT-862:
----------------------------------------

I accidentally committed this when making a minor change to build-reuters.sh.  Rather than rollback, I'm going to let it stick and others can simply patch what I put in, as it is new functionality that doesn't change the old functionality.
                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142254#comment-13142254 ] 

Ted Dunning commented on MAHOUT-862:
------------------------------------

Just looked at the MH3 code.  It definitely needs the ByteBuffer treatment.  It also needs to have test cases.  Yonik's word is usually good, but I don't like having code without good tests.
                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-862) MurmurHash 3.0

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142301#comment-13142301 ] 

Dawid Weiss commented on MAHOUT-862:
------------------------------------

bq. getLong works really well on ByteBuffers

That's probably because it's an intrinsic reading in 8 bytes at a time without bounds checking, but I'd have to confirm that by looking at the jitted code dump (or in openjdk code). Put at the end of my todo list for tonight ;)
                
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-862) MurmurHash 3.0

Posted by "Grant Ingersoll (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-862:
-----------------------------------

    Fix Version/s: 0.6
    
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MAHOUT-862) MurmurHash 3.0

Posted by "Grant Ingersoll (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved MAHOUT-862.
------------------------------------

    Resolution: Fixed
    
> MurmurHash 3.0
> --------------
>
>                 Key: MAHOUT-862
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-862
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: MAHOUT-862.patch
>
>
> Yonik has ported an implementation of MurmurHash 3.0 and put it in the public domain: http://www.lucidimagination.com/blog/2011/09/15/murmurhash3-for-java/
> It's a port of https://sites.google.com/site/murmurhash/ which says: 
> {quote}
> (I reserve the right to tweak the constants after people have had a chance to bang on it). Murmur3 has better performance than MurmurHash2, no repetition flaw, comes in 32/64/128-bit versions for both x86 and x64 platforms, and the 128-bit x64 version is blazing fast - over 5 gigabytes per second on my 3 gigahertz Core 2.
> In addition, the library of test code that I use to test MurmurHash (called SMHasher) has been released - it's still rough (and will only compile under VC++ at the moment), but it contains everything needed to verify hash functions of arbitrary output bit-lengths.
> Murmur3 and all future versions will be hosted on Google Code here - http://code.google.com/p/smhasher/ - you can access the codebase via the 'Source' tab at the top.
> {quote}
> See also http://code.google.com/p/smhasher/
> We should add support for it and hook into MinHash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira