You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Dawid Weiss <da...@cs.put.poznan.pl> on 2011/03/04 12:11:34 UTC

Performance of primitive collections

A while ago we discussed if it'd make sense to integrate HPPC with Mahout, I
remember Ted mentioned it'd be good to have some kind of comparative
benchmark that would simply compare Mahout Math collections against HPPC. I
finally found the time to do it, driven by some strong updates Sebastiano
Vinga introduced in fastutil. But first, a disclaimer :)

I am not trying to impose HPPC on anybody and I'm not trying to show which
library is "better" or "worse" in a general sense. I believe the benchmarks
presented below and any conclusions that can be drawn from them apply only
in a marginal sense to any algorithm complex enough to be doing something
ELSE other than stressing a given hash map implementation. I also believe
that diversity drives progress, so it's good to share ;)

So. First, I've been thinking about how to measure hash map performance in
any sensible (read: reasonable) application. Sebastiano suggested some clues
and I came up with some other ideas, but any realistic benchmarks (in the
sense of data or algorithm) would be a welcome extension. I've implemented
the following benchmarks (all on int-int maps):

- BenchmarkPut: injecting a lot of integers in a map, basically what one
would do to get a unique set of keys, for example. The input ints have
different distributions: linear, random and high-bits linear (increments on
higher bits, then on lower bits).

- BenchmarkGetWithRemoved: injecting a lot of integers, then removing a
fraction of them (not benchmarked), and (benchmarked) querying this map for
a partially matching set of keys. A realistic scenario if you have removals
from the map.

- BenchmarkBigramCounting: calculating bigram frequencies from a real corpus
of texts in Polish. The input is 10MB long, so it's not a big data set, but
the texts are realistic and hence so are bigram distributions.

I tested the following hash map implementations (int-int primitives):

- fastutil (6.1.0),
- HPPC (trunk, but without improvements resulting from these benchmarks yet,
so I'm not biased ;)
- Java HashMap (on boxed Integer keys)
- GNU Trove (3.0.0rc1),
- Mahout Math/Collections (1.0)

The bigram counting also includes PCJ and fastutil's linked hash map
implementation for historic reasons (but these are less relevant).

The benchmarks were done on a Windows7 machine with I7 CPU, detailed log
attached, using Google Caliper. Times below are in ms. Links
should lead you to on-line charts. Some conclusions:

1) BenchmarkPut,
http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkPut

distribution implementation    ms linear runtime
      RANDOM           HPPC  77.9 ==========
      LINEAR           HPPC  79.1 ==========
    HIGHBITS           HPPC  78.4 ==========
      RANDOM       FASTUTIL  97.4 ============
      LINEAR       FASTUTIL 101.2 =============
    HIGHBITS       FASTUTIL 101.3 =============
      RANDOM           JAVA 232.6 ==============================
      LINEAR           JAVA  72.7 =========
    HIGHBITS           JAVA 184.2 =======================
      RANDOM         MAHOUT 125.9 ================
      LINEAR         MAHOUT  60.3 =======
    HIGHBITS         MAHOUT  98.4 ============
      RANDOM          TROVE 113.8 ==============
      LINEAR          TROVE  43.8 =====
    HIGHBITS          TROVE  83.7 ==========

I believe additional good hashing of integer keys does help to even-out any
weird input key distributions (and these would occur a lot in practical
situations, I bet). fastutil uses the finalization step from MurmurHash3,
HPPC uses a simplified full-pass over int4 bytes from MurmurHash2, but these
yield an equal effect (the speed difference is a result of another thing).
Mahout Math currently doesn't hash ints, so it'd be probably a good addition
(at very little overhead).

2) BenchmarkGetWithRemoved:
http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkGetWithRemoved

%rem.keys   implementation     ms linear runtime
          0           HPPC  65.73 ============
          0       FASTUTIL  52.63 ==========
          0           JAVA 156.37 ==============================
          0          TROVE  75.31 ==============
          0         MAHOUT  85.59 ================
       0.25           HPPC  87.19 ================
       0.25       FASTUTIL  58.24 ===========
       0.25           JAVA 149.26 ============================
       0.25          TROVE 123.32 =======================
       0.25         MAHOUT 113.77 =====================
        0.5           HPPC 101.70 ===================
        0.5       FASTUTIL  55.74 ==========
        0.5           JAVA 135.01 =========================
        0.5          TROVE  93.19 =================
        0.5         MAHOUT 125.16 ========================
       0.75           HPPC  93.98 ==================
       0.75       FASTUTIL  42.20 ========
       0.75           JAVA 101.29 ===================
       0.75          TROVE  76.07 ==============
       0.75         MAHOUT  94.67 ==================
          1           HPPC  60.52 ===========
          1       FASTUTIL  10.78 ==
          1           JAVA  44.30 ========
          1          TROVE   9.54 =
          1         MAHOUT  30.98 =====

This shows two things. The first thing is that open addressing based on
double hashing (with REMOVED state markers) requires either automatic or
manual compaction (rehashing) to avoid performance degradation. HPPC doesn't
do this because we rely on the user and so the time grows with the number of
removed keys:

          0           HPPC  65.73 ============
       0.25           HPPC  87.19 ================
        0.5           HPPC 101.70 ===================
       0.75           HPPC  93.98 ==================
          1           HPPC  60.52 ===========

(the drop at 100% of removed keys is probably HotSpot overlearning and
optimizing differently). It is curious as to why Mahout's implementation was
so much slower than anything else, but I didn't have the time to investigate
yet. The only thing that comes to my mind is using modulus % for sizing
instead of bitmasking (restricted 2^n map sizes). Also note that even
regular Java does a fairly good job here (the number of keys was 2,000,000,
but even for larger inputs the results were quite similar). Also, note
fastutil here:

          0       FASTUTIL  52.63 ==========
       0.25       FASTUTIL  58.24 ===========
        0.5       FASTUTIL  55.74 ==========
       0.75       FASTUTIL  42.20 ========
          1       FASTUTIL  10.78 ==

Nice. Sebastiano reimplemented hash maps to use linear addressing instead of
double hashing and with linear addressing you can avoid REMOVED state
markers entirely (so no need to rehash and you'll get empty spots for reuse
if adding new elements). The key here is the combination of linear
addressing AND a good key hash mangling function (theoretically, with a
perfect hashing function, the input key distribution wouldn't affect linear
addressing performance at all -- no clusters or chains effect.

3) Bigrams,
http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkBigramCounting

This is a toy example, I know, but it's the only realistic one in the set,
so it's probably worth mentioning. It is a little bit puzzling to me why
HPPC is (by a small margin) better than fastutil because it really shouldn't
be. This isn't exactly noise (variance is given in plain text attached), but
I still consider this a glitch within the regular variance.

You can try these on your own machine -- I would actually appreciate if you
sent your results and feedback back to me or this list so that we can see
how the above holds consistent betwen architectures, processors and VMs.

git clone git@github.com:carrotsearch/hppc.git
cd hppc-others
(cd lib/; bash ./install-deps )
mvn clean package
java -server -jar hppc-others-0.4.0-SNAPSHOT/hppc-others*.jar | tee
benchmark.log

I didn't check the performance of simple lists or other structures with
minimal overhead because I don't imagine they being significantly faster or
slower in any library. For these other facts play a role (I personally like
the fact I can access the internal buffers in HPPC, but it may against be a
point against using it ;).

Dawid

Re: Performance of primitive collections

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> Is there a point where we just start using HPPC or do we do something it
doesn't and so the performance tests here are simply a pedagogical exercise
for those interested in hashing strategies

Hard to tell for sure without a realistic use case I think. Microbenchmarks
may be misleading, the general costs of whatever is computed may swallow
minor differences between implementations. Plus there *are* minor
differences between implementations:

1) resize/rehash strategy is different (prime vs. power-of-two),
2) slot lookup strategy is different (rehash-of-hash + linear lookup vs.
hash + double hashing).

If nobody invests the time to actually test over a large(r) body of
applications then I'd say the switch doesn't make much sense at this point.

Dawid

Re: Performance of primitive collections

Posted by Robin Anil <ro...@gmail.com>.
There is a few other things to notice here. This only affects our RASV and
not the others. Higher order vector ops have tricks to speed it up, we are
talking about  < 10% speedup margin overall for algorithms. This doesn't
say anything about iteration speed, which is also what we care a lot about.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Thu, Jun 6, 2013 at 10:20 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Been following closely all the threads here (and learning a lot).  Is
> there a point where we just start using HPPC or do we do something it
> doesn't and so the performance tests here are simply a pedagogical exercise
> for those interested in hashing strategies b/c there is no way we can
> switch or there is no practical reason to switch?
>
> (ducks)
>
> Has anyone hooked up a profiler to Mahout and run the tests below to see
> where our bottlenecks are?  Robin seemed to indicate to me that we should
> be as fast the other day.
>
> -Grant
>
> On Jun 6, 2013, at 3:22 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > I also did a small experiment myself, not conclusive but putting here for
> > the sake of record.
> >
> > adjustOrPutValue() was calling put if it didn't find the entry in the
> > table, which re-did the hash lookup, it turns out that reducing that
> extra
> > lookup didn't help the benchmarks significantly. (1.4 drop is not
> > significant)
> >
> >
> > Before
> >        library    ms linear runtime
> >           HPPC  85.5 =================
> >          TROVE 112.9 =======================
> >  FASTUTIL_OPEN  65.3 =============
> > FASTUTIL_LINKED  61.5 ============
> >         MAHOUT 143.4 ==============================
> >
> >
> > After
> >
> >        library    ms linear runtime
> >           HPPC  85.3 ==================
> >          TROVE 111.1 =======================
> >  FASTUTIL_OPEN  65.0 =============
> > FASTUTIL_LINKED  61.2 ============
> >         MAHOUT 142.0 ==============================
> >
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Thu, Jun 6, 2013 at 9:05 AM, Dawid Weiss <
> dawid.weiss@cs.put.poznan.pl>wrote:
> >
> >>> Yes I thought this was why you could largely get away with it. If the
> >>> table is big enough, collisions and chains should not be a big deal to
> >>> begin with.
> >>
> >> I admit I didn't think it could be double hashing behind so I assumed
> 2^n-1
> >> -- sorry for not checking before. Good to be able to compare alternate
> >> implementations (the upside of prime-based sizes is that they grow
> somewhat
> >> less aggressively).
> >>
> >>> Since you have this benchmark readily available -- what happens if you
> >>> un-comment one or both of the lines of code that mixes the hash? That
> >>> would really answer it.
> >>
> >> I've ran the same set of benchmarks substituting MurmurHash3's last
> >> avalanche step in HashOperations. The results are, in short, that it
> does
> >> get a tad bit slower on the average, but is (predictably) more stable --
> >>
> >> BenchmarkContainsWithRemoved (2000000 calls):
> >> Before:
> >> removedKeys    ms linear runtime
> >>          0  95.6 =====================
> >>       0.25 116.2 ==========================
> >>        0.5 131.8 ==============================
> >>       0.75  78.3 =================
> >>          1  12.9 ==
> >> After
> >> removedKeys    ms linear runtime
> >>          0 101.4 =====================
> >>       0.25 123.6 ==========================
> >>        0.5 138.6 ==============================
> >>       0.75  82.0 =================
> >>          1  14.7 ===
> >>
> >> BenchmarkPut (1000000 calls):
> >> Before:
> >> distribution   ms linear runtime
> >>      RANDOM 61.8 ==============================
> >>      LINEAR 20.9 ==========
> >>    HIGHBITS 54.8 ==========================
> >> After:
> >> distribution   ms linear runtime
> >>      RANDOM 64.9 =============================
> >>      LINEAR 65.2 =============================
> >>    HIGHBITS 65.7 ==============================
> >>
> >> So you're right -- I don't think it makes sense to add that hashing
> unless
> >> you can devise a worst-case sequence that will break double hashing (I
> >> can't). Note, however, that these times are generally slower than with
> >> 2^n-1 implementations with hashing and linear probing; this seems clear
> >> taking into account CPU caches and the fact that less (and simpler) math
> >> needs to be done for every key. The advantage I see in double hashing is
> >> being able to devise a prime sequence that doesn't grow as fast as 2^n.
> >>
> >> So the conclusion: no need for hashing. Which makes me thing why bother
> to
> >> do this then:
> >>
> >>  public static int hash(float value) {
> >>    return Float.floatToIntBits(value * 663608941.737f);
> >>
> >> or this, for that matter:
> >>
> >>  public static int hash(long value) {
> >>    return (int) (value ^ (value >> 32));
> >>
> >> Note all bits will cancel out to zero if your input keys are (value <<
> 32 |
> >> value). If you remember I mentioned guarding against the worst case -- a
> >> random hash seems to be less probable then this (run it):
> >>
> >>        OpenLongIntHashMap map = new OpenLongIntHashMap();
> >>        for (int i = 0; i < 1000000; i++) {
> >>            long k = (long) i << 32 | i;
> >>            map.put(k, i);
> >>        }
> >>
> >> Dawid
> >>
>
> --------------------------------------------
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
>
>
>
>
>
>

Re: Performance of primitive collections

Posted by Grant Ingersoll <gs...@apache.org>.
Been following closely all the threads here (and learning a lot).  Is there a point where we just start using HPPC or do we do something it doesn't and so the performance tests here are simply a pedagogical exercise for those interested in hashing strategies b/c there is no way we can switch or there is no practical reason to switch?

(ducks)

Has anyone hooked up a profiler to Mahout and run the tests below to see where our bottlenecks are?  Robin seemed to indicate to me that we should be as fast the other day.

-Grant

On Jun 6, 2013, at 3:22 PM, Robin Anil <ro...@gmail.com> wrote:

> I also did a small experiment myself, not conclusive but putting here for
> the sake of record.
> 
> adjustOrPutValue() was calling put if it didn't find the entry in the
> table, which re-did the hash lookup, it turns out that reducing that extra
> lookup didn't help the benchmarks significantly. (1.4 drop is not
> significant)
> 
> 
> Before
>        library    ms linear runtime
>           HPPC  85.5 =================
>          TROVE 112.9 =======================
>  FASTUTIL_OPEN  65.3 =============
> FASTUTIL_LINKED  61.5 ============
>         MAHOUT 143.4 ==============================
> 
> 
> After
> 
>        library    ms linear runtime
>           HPPC  85.3 ==================
>          TROVE 111.1 =======================
>  FASTUTIL_OPEN  65.0 =============
> FASTUTIL_LINKED  61.2 ============
>         MAHOUT 142.0 ==============================
> 
> 
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> 
> 
> On Thu, Jun 6, 2013 at 9:05 AM, Dawid Weiss <da...@cs.put.poznan.pl>wrote:
> 
>>> Yes I thought this was why you could largely get away with it. If the
>>> table is big enough, collisions and chains should not be a big deal to
>>> begin with.
>> 
>> I admit I didn't think it could be double hashing behind so I assumed 2^n-1
>> -- sorry for not checking before. Good to be able to compare alternate
>> implementations (the upside of prime-based sizes is that they grow somewhat
>> less aggressively).
>> 
>>> Since you have this benchmark readily available -- what happens if you
>>> un-comment one or both of the lines of code that mixes the hash? That
>>> would really answer it.
>> 
>> I've ran the same set of benchmarks substituting MurmurHash3's last
>> avalanche step in HashOperations. The results are, in short, that it does
>> get a tad bit slower on the average, but is (predictably) more stable --
>> 
>> BenchmarkContainsWithRemoved (2000000 calls):
>> Before:
>> removedKeys    ms linear runtime
>>          0  95.6 =====================
>>       0.25 116.2 ==========================
>>        0.5 131.8 ==============================
>>       0.75  78.3 =================
>>          1  12.9 ==
>> After
>> removedKeys    ms linear runtime
>>          0 101.4 =====================
>>       0.25 123.6 ==========================
>>        0.5 138.6 ==============================
>>       0.75  82.0 =================
>>          1  14.7 ===
>> 
>> BenchmarkPut (1000000 calls):
>> Before:
>> distribution   ms linear runtime
>>      RANDOM 61.8 ==============================
>>      LINEAR 20.9 ==========
>>    HIGHBITS 54.8 ==========================
>> After:
>> distribution   ms linear runtime
>>      RANDOM 64.9 =============================
>>      LINEAR 65.2 =============================
>>    HIGHBITS 65.7 ==============================
>> 
>> So you're right -- I don't think it makes sense to add that hashing unless
>> you can devise a worst-case sequence that will break double hashing (I
>> can't). Note, however, that these times are generally slower than with
>> 2^n-1 implementations with hashing and linear probing; this seems clear
>> taking into account CPU caches and the fact that less (and simpler) math
>> needs to be done for every key. The advantage I see in double hashing is
>> being able to devise a prime sequence that doesn't grow as fast as 2^n.
>> 
>> So the conclusion: no need for hashing. Which makes me thing why bother to
>> do this then:
>> 
>>  public static int hash(float value) {
>>    return Float.floatToIntBits(value * 663608941.737f);
>> 
>> or this, for that matter:
>> 
>>  public static int hash(long value) {
>>    return (int) (value ^ (value >> 32));
>> 
>> Note all bits will cancel out to zero if your input keys are (value << 32 |
>> value). If you remember I mentioned guarding against the worst case -- a
>> random hash seems to be less probable then this (run it):
>> 
>>        OpenLongIntHashMap map = new OpenLongIntHashMap();
>>        for (int i = 0; i < 1000000; i++) {
>>            long k = (long) i << 32 | i;
>>            map.put(k, i);
>>        }
>> 
>> Dawid
>> 

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com






Re: Performance of primitive collections

Posted by Robin Anil <ro...@gmail.com>.
I also did a small experiment myself, not conclusive but putting here for
the sake of record.

adjustOrPutValue() was calling put if it didn't find the entry in the
table, which re-did the hash lookup, it turns out that reducing that extra
lookup didn't help the benchmarks significantly. (1.4 drop is not
significant)


Before
        library    ms linear runtime
           HPPC  85.5 =================
          TROVE 112.9 =======================
  FASTUTIL_OPEN  65.3 =============
FASTUTIL_LINKED  61.5 ============
         MAHOUT 143.4 ==============================


After

        library    ms linear runtime
           HPPC  85.3 ==================
          TROVE 111.1 =======================
  FASTUTIL_OPEN  65.0 =============
FASTUTIL_LINKED  61.2 ============
         MAHOUT 142.0 ==============================


Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Thu, Jun 6, 2013 at 9:05 AM, Dawid Weiss <da...@cs.put.poznan.pl>wrote:

> > Yes I thought this was why you could largely get away with it. If the
> > table is big enough, collisions and chains should not be a big deal to
> > begin with.
>
> I admit I didn't think it could be double hashing behind so I assumed 2^n-1
> -- sorry for not checking before. Good to be able to compare alternate
> implementations (the upside of prime-based sizes is that they grow somewhat
> less aggressively).
>
> > Since you have this benchmark readily available -- what happens if you
> > un-comment one or both of the lines of code that mixes the hash? That
> > would really answer it.
>
> I've ran the same set of benchmarks substituting MurmurHash3's last
> avalanche step in HashOperations. The results are, in short, that it does
> get a tad bit slower on the average, but is (predictably) more stable --
>
> BenchmarkContainsWithRemoved (2000000 calls):
> Before:
> removedKeys    ms linear runtime
>           0  95.6 =====================
>        0.25 116.2 ==========================
>         0.5 131.8 ==============================
>        0.75  78.3 =================
>           1  12.9 ==
> After
> removedKeys    ms linear runtime
>           0 101.4 =====================
>        0.25 123.6 ==========================
>         0.5 138.6 ==============================
>        0.75  82.0 =================
>           1  14.7 ===
>
> BenchmarkPut (1000000 calls):
> Before:
> distribution   ms linear runtime
>       RANDOM 61.8 ==============================
>       LINEAR 20.9 ==========
>     HIGHBITS 54.8 ==========================
> After:
> distribution   ms linear runtime
>       RANDOM 64.9 =============================
>       LINEAR 65.2 =============================
>     HIGHBITS 65.7 ==============================
>
> So you're right -- I don't think it makes sense to add that hashing unless
> you can devise a worst-case sequence that will break double hashing (I
> can't). Note, however, that these times are generally slower than with
> 2^n-1 implementations with hashing and linear probing; this seems clear
> taking into account CPU caches and the fact that less (and simpler) math
> needs to be done for every key. The advantage I see in double hashing is
> being able to devise a prime sequence that doesn't grow as fast as 2^n.
>
> So the conclusion: no need for hashing. Which makes me thing why bother to
> do this then:
>
>   public static int hash(float value) {
>     return Float.floatToIntBits(value * 663608941.737f);
>
> or this, for that matter:
>
>   public static int hash(long value) {
>     return (int) (value ^ (value >> 32));
>
> Note all bits will cancel out to zero if your input keys are (value << 32 |
> value). If you remember I mentioned guarding against the worst case -- a
> random hash seems to be less probable then this (run it):
>
>         OpenLongIntHashMap map = new OpenLongIntHashMap();
>         for (int i = 0; i < 1000000; i++) {
>             long k = (long) i << 32 | i;
>             map.put(k, i);
>         }
>
> Dawid
>

Re: Performance of primitive collections

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> Yes I thought this was why you could largely get away with it. If the
> table is big enough, collisions and chains should not be a big deal to
> begin with.

I admit I didn't think it could be double hashing behind so I assumed 2^n-1
-- sorry for not checking before. Good to be able to compare alternate
implementations (the upside of prime-based sizes is that they grow somewhat
less aggressively).

> Since you have this benchmark readily available -- what happens if you
> un-comment one or both of the lines of code that mixes the hash? That
> would really answer it.

I've ran the same set of benchmarks substituting MurmurHash3's last
avalanche step in HashOperations. The results are, in short, that it does
get a tad bit slower on the average, but is (predictably) more stable --

BenchmarkContainsWithRemoved (2000000 calls):
Before:
removedKeys    ms linear runtime
          0  95.6 =====================
       0.25 116.2 ==========================
        0.5 131.8 ==============================
       0.75  78.3 =================
          1  12.9 ==
After
removedKeys    ms linear runtime
          0 101.4 =====================
       0.25 123.6 ==========================
        0.5 138.6 ==============================
       0.75  82.0 =================
          1  14.7 ===

BenchmarkPut (1000000 calls):
Before:
distribution   ms linear runtime
      RANDOM 61.8 ==============================
      LINEAR 20.9 ==========
    HIGHBITS 54.8 ==========================
After:
distribution   ms linear runtime
      RANDOM 64.9 =============================
      LINEAR 65.2 =============================
    HIGHBITS 65.7 ==============================

So you're right -- I don't think it makes sense to add that hashing unless
you can devise a worst-case sequence that will break double hashing (I
can't). Note, however, that these times are generally slower than with
2^n-1 implementations with hashing and linear probing; this seems clear
taking into account CPU caches and the fact that less (and simpler) math
needs to be done for every key. The advantage I see in double hashing is
being able to devise a prime sequence that doesn't grow as fast as 2^n.

So the conclusion: no need for hashing. Which makes me thing why bother to
do this then:

  public static int hash(float value) {
    return Float.floatToIntBits(value * 663608941.737f);

or this, for that matter:

  public static int hash(long value) {
    return (int) (value ^ (value >> 32));

Note all bits will cancel out to zero if your input keys are (value << 32 |
value). If you remember I mentioned guarding against the worst case -- a
random hash seems to be less probable then this (run it):

        OpenLongIntHashMap map = new OpenLongIntHashMap();
        for (int i = 0; i < 1000000; i++) {
            long k = (long) i << 32 | i;
            map.put(k, i);
        }

Dawid

Re: Performance of primitive collections

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> Never mind. I found the checked in jar in the lib folder.

Yep. Is there a snapshot repo which gets updates nightly or something? I
didn't find one.

Dawid

Re: Performance of primitive collections

Posted by Robin Anil <ro...@gmail.com>.
Never mind. I found the checked in jar in the lib folder.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Thu, Jun 6, 2013 at 3:13 AM, Robin Anil <ro...@gmail.com> wrote:

> Dawid, could you explain how this is accessing the mahout trunk code? Is
> that part of the package step?
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Wed, Jun 5, 2013 at 8:20 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> On Wed, Jun 5, 2013 at 7:00 PM, Dawid Weiss
>> <da...@cs.put.poznan.pl> wrote:
>> > 1) if you know your keys are 0...N-1 (dense) then there is typically no
>> > need to use a map; you'll do just as well with a regular lookup array :)
>>
>> Yeah of course... here I think there is no guarantee of this although
>> in many practical use cases the keys are drawn from a small range. For
>> example 1000 user IDs are often around the range 1 - 1000 or
>> something.
>>
>>
>> > Having written the above it made me wonder why Mahout's performance does
>> > not drop significantly on those malicious upper-bit changing
>> benchmarks, so
>> > I checked the code. The thing is Mahout (and I think Colt, originally)
>> uses
>> > double hashing with prime-sized tables (not 2^n-1 masking) -- this is
>> nice
>> > and indeed avoids the problems of clustering mentioned in pt. 2 above.
>>
>> Yes I thought this was why you could largely get away with it. If the
>> table is big enough, collisions and chains should not be a big deal to
>> begin with.
>>
>> Since you have this benchmark readily available -- what happens if you
>> un-comment one or both of the lines of code that mixes the hash? That
>> would really answer it.
>>
>> i'm curious as I have certainly made this assumption independently in
>> the custom hashtable implementations I made long ago (which also use
>> double hashing) and still use in the project. If it's not optimal
>> that's important info.
>>
>
>

Re: Performance of primitive collections

Posted by Robin Anil <ro...@gmail.com>.
Dawid, could you explain how this is accessing the mahout trunk code? Is
that part of the package step?

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Wed, Jun 5, 2013 at 8:20 PM, Sean Owen <sr...@gmail.com> wrote:

> On Wed, Jun 5, 2013 at 7:00 PM, Dawid Weiss
> <da...@cs.put.poznan.pl> wrote:
> > 1) if you know your keys are 0...N-1 (dense) then there is typically no
> > need to use a map; you'll do just as well with a regular lookup array :)
>
> Yeah of course... here I think there is no guarantee of this although
> in many practical use cases the keys are drawn from a small range. For
> example 1000 user IDs are often around the range 1 - 1000 or
> something.
>
>
> > Having written the above it made me wonder why Mahout's performance does
> > not drop significantly on those malicious upper-bit changing benchmarks,
> so
> > I checked the code. The thing is Mahout (and I think Colt, originally)
> uses
> > double hashing with prime-sized tables (not 2^n-1 masking) -- this is
> nice
> > and indeed avoids the problems of clustering mentioned in pt. 2 above.
>
> Yes I thought this was why you could largely get away with it. If the
> table is big enough, collisions and chains should not be a big deal to
> begin with.
>
> Since you have this benchmark readily available -- what happens if you
> un-comment one or both of the lines of code that mixes the hash? That
> would really answer it.
>
> i'm curious as I have certainly made this assumption independently in
> the custom hashtable implementations I made long ago (which also use
> double hashing) and still use in the project. If it's not optimal
> that's important info.
>

Re: Performance of primitive collections

Posted by Sean Owen <sr...@gmail.com>.
On Wed, Jun 5, 2013 at 7:00 PM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
> 1) if you know your keys are 0...N-1 (dense) then there is typically no
> need to use a map; you'll do just as well with a regular lookup array :)

Yeah of course... here I think there is no guarantee of this although
in many practical use cases the keys are drawn from a small range. For
example 1000 user IDs are often around the range 1 - 1000 or
something.


> Having written the above it made me wonder why Mahout's performance does
> not drop significantly on those malicious upper-bit changing benchmarks, so
> I checked the code. The thing is Mahout (and I think Colt, originally) uses
> double hashing with prime-sized tables (not 2^n-1 masking) -- this is nice
> and indeed avoids the problems of clustering mentioned in pt. 2 above.

Yes I thought this was why you could largely get away with it. If the
table is big enough, collisions and chains should not be a big deal to
begin with.

Since you have this benchmark readily available -- what happens if you
un-comment one or both of the lines of code that mixes the hash? That
would really answer it.

i'm curious as I have certainly made this assumption independently in
the custom hashtable implementations I made long ago (which also use
double hashing) and still use in the project. If it's not optimal
that's important info.

Re: Performance of primitive collections

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Hi Sean,

> I don't know the history of that method, but I think there are reasons
> why no mixing could be beneficial. I don't know if it ends up being
> true in practice. For example, if you frequently store keys with
> values 0..N-1 then in a big-enough map, and maybe even often access
> the keys in order, then this tends to give you dense and linear access
> to the table.

There are a few reasons. Theoretical/ motivational:

1) if you know your keys are 0...N-1 (dense) then there is typically no
need to use a map; you'll do just as well with a regular lookup array :)

2) if your keys are sparsely distributed in 0...N-1 then there is a threat
that you'll hit long chains of collisions for certain slot lookup
implementations (linear or quadratic probing). This is in particular the
case when the key is a compound of some sort of shifted values; like a long
created from two integers where x = a << 32 | b. This isn't as uncommon as
it may seem because it's convenient to concatenate several values into a
single primitive than have the overhead of an indirect lookup or a full
object.

And then there are practical/ own experience considerations.

1) I thought no-rehashing would be fine in the early days of HPPC. It hurt
me badly in terms of performance in real life applications (see pt. 2
above).
2) in practice rehashing is very cheap -- if you look at the benchmarks I
posted recently you'll see that it adds almost no noticeable overhead
(everything is within  L1 cache or even inside registers).

Having written the above it made me wonder why Mahout's performance does
not drop significantly on those malicious upper-bit changing benchmarks, so
I checked the code. The thing is Mahout (and I think Colt, originally) uses
double hashing with prime-sized tables (not 2^n-1 masking) -- this is nice
and indeed avoids the problems of clustering mentioned in pt. 2 above.

So maybe you're right and this indeed isn't an issue given double hashing
is the default slot-lookup technique. I'll leave it up to you to either
close the issue as invalid or consider making the change nonetheless.

Dawid

> If the map is big enough that collisions are rare, then the negative
> side effects you get from chaining are small and rare.
>
> I haven't thought deeply about it and don't have evidence it's better
> or worse but there may be a reason.
>
> On Wed, Jun 5, 2013 at 1:49 PM, Dawid Weiss
> <da...@cs.put.poznan.pl> wrote:
>>> But that's absolutely weird. The mixing function should take care of
that.
>>
>>> But that's absolutely weird. The mixing function should take care of
that.
>>
>> Sure, unless you don't have any... This is what's currently in Mahout
>> (look closely at the first line!):
>>
>>   public static int hash(int value) {
>>     return value;
>>
>>     //return value * 0x278DDE6D; // see
>> org.apache.mahout.math.jet.random.engine.DRand
>>
>>     /*
>>     value &= 0x7FFFFFFF; // make it >=0
>>     int hashCode = 0;
>>     do hashCode = 31*hashCode + value%10;
>>     while ((value /= 10) > 0);
>>
>>     return 28629151*hashCode; // spread even further; h*31^5
>>     */
>>   }
>>
>> So there is no redistributing of keys. In my opinion this is a bug
>> that should be addressed.
>>
>> Dawid
>

Re: Performance of primitive collections

Posted by Sean Owen <sr...@gmail.com>.
I don't know the history of that method, but I think there are reasons
why no mixing could be beneficial. I don't know if it ends up being
true in practice. For example, if you frequently store keys with
values 0..N-1 then in a big-enough map, and maybe even often access
the keys in order, then this tends to give you dense and linear access
to the table.

If the map is big enough that collisions are rare, then the negative
side effects you get from chaining are small and rare.

I haven't thought deeply about it and don't have evidence it's better
or worse but there may be a reason.

On Wed, Jun 5, 2013 at 1:49 PM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
>> But that's absolutely weird. The mixing function should take care of that.
>
>> But that's absolutely weird. The mixing function should take care of that.
>
> Sure, unless you don't have any... This is what's currently in Mahout
> (look closely at the first line!):
>
>   public static int hash(int value) {
>     return value;
>
>     //return value * 0x278DDE6D; // see
> org.apache.mahout.math.jet.random.engine.DRand
>
>     /*
>     value &= 0x7FFFFFFF; // make it >=0
>     int hashCode = 0;
>     do hashCode = 31*hashCode + value%10;
>     while ((value /= 10) > 0);
>
>     return 28629151*hashCode; // spread even further; h*31^5
>     */
>   }
>
> So there is no redistributing of keys. In my opinion this is a bug
> that should be addressed.
>
> Dawid

Re: Performance of primitive collections

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> But that's absolutely weird. The mixing function should take care of that.

> But that's absolutely weird. The mixing function should take care of that.

Sure, unless you don't have any... This is what's currently in Mahout
(look closely at the first line!):

  public static int hash(int value) {
    return value;

    //return value * 0x278DDE6D; // see
org.apache.mahout.math.jet.random.engine.DRand

    /*
    value &= 0x7FFFFFFF; // make it >=0
    int hashCode = 0;
    do hashCode = 31*hashCode + value%10;
    while ((value /= 10) > 0);

    return 28629151*hashCode; // spread even further; h*31^5
    */
  }

So there is no redistributing of keys. In my opinion this is a bug
that should be addressed.

Dawid

Re: Performance of primitive collections

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Hi folks,

I re-run my benchmarks with the latest and greatest from Mahout Math,
fastutil, HPPC and Trove. Mind you these results will vary depending on the
CPU, architecture, JVM version etc... so no flame wars. If you want to
repeat them in your own setting then:

git clone git@github.com:carrotsearch/hppc.git
cd hppc
mvn package
cd hppc-benchmarks/target/hppc-benchmarks-0.6.0-SNAPSHOT
java -jar hppc-benchmarks-0.6.0-SNAPSHOT.jar

The results show what we already established once -- HPPC and fastutil are
typically faster and more "stable" with regard to input type. I think these
are due to murmur hashing being applied to keys and linear, instead of
quadratic, probing.

1) Simple bigram counting (Polish texts).

{code}
================================================================================
------------------------ BenchmarkBigramCounting (1/1)
-------------------------
================================================================================
        library    ms linear runtime
           HPPC  67.3 =================
          TROVE  99.1 =========================
  FASTUTIL_OPEN  52.3 =============
FASTUTIL_LINKED  48.2 ============
         MAHOUT 118.3 ==============================
{code}

2) contains() with removed entries.

{code}
================================================================================
---------------------- BenchmarkContainsWithRemoved (1/1)
----------------------
================================================================================

%removedKeys implementation     ms linear runtime
           0           HPPC  43.55 =========
           0       FASTUTIL  43.08 =========
           0           JAVA 110.80 =========================
           0          TROVE  85.32 ===================
           0         MAHOUT  94.36 =====================

        0.25           HPPC  46.27 ==========
        0.25       FASTUTIL  44.21 ==========
        0.25           JAVA 113.25 =========================
        0.25          TROVE 116.77 ==========================
        0.25         MAHOUT 116.16 ==========================

         0.5           HPPC  45.85 ==========
         0.5       FASTUTIL  43.75 =========
         0.5           JAVA  97.21 ======================
         0.5          TROVE  85.82 ===================
         0.5         MAHOUT 131.44 ==============================

        0.75           HPPC  31.92 =======
        0.75       FASTUTIL  33.57 =======
        0.75           JAVA  71.20 ================
        0.75          TROVE  70.73 ================
        0.75         MAHOUT  77.89 =================

           1           HPPC   7.97 =
           1       FASTUTIL   9.39 ==
           1           JAVA  31.38 =======
           1          TROVE   6.31 =
           1         MAHOUT  12.90 ==
{code}

3) put() with different input distributions.

{code}
================================================================================
------------------------------ BenchmarkPut (1/1)
------------------------------
================================================================================

distribution implementation    ms linear runtime
      RANDOM           HPPC  46.6 =======
      RANDOM       FASTUTIL  47.1 =======
      RANDOM           JAVA 181.8 ==============================
      RANDOM          TROVE  78.4 ============
      RANDOM         MAHOUT  61.8 ==========

      LINEAR           HPPC  46.7 =======
      LINEAR       FASTUTIL  47.2 =======
      LINEAR           JAVA  30.5 =====
      LINEAR          TROVE  15.9 ==
      LINEAR         MAHOUT  20.9 ===

    HIGHBITS           HPPC  46.7 =======
    HIGHBITS       FASTUTIL  47.2 =======
    HIGHBITS           JAVA 142.7 =======================
    HIGHBITS          TROVE  60.3 =========
    HIGHBITS         MAHOUT  55.0 =========
{code}

Cheers,
Dawid

Re: Performance of primitive collections

Posted by Ted Dunning <te...@gmail.com>.
Master/trunk is the place to test.


On Mon, Jun 3, 2013 at 7:32 AM, Dawid Weiss <da...@cs.put.poznan.pl>wrote:

> > Dawid, do you have your existing benchmark code against
> > fastutil/hppc/trove. Since the performance improvements we made in last
> > couple of months I am itching to revisit the numbers.
>
> I can rerun those that I have -- will let you know! Should I use the
> master branch or the official latest release?
>
> Dawid
>

Re: Performance of primitive collections

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> Dawid, do you have your existing benchmark code against
> fastutil/hppc/trove. Since the performance improvements we made in last
> couple of months I am itching to revisit the numbers.

I can rerun those that I have -- will let you know! Should I use the
master branch or the official latest release?

Dawid

Re: Performance of primitive collections

Posted by Robin Anil <ro...@gmail.com>.
There has been some improvements in the hashmaps, so I would like to re-run
these tests.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Jun 3, 2013 at 5:46 AM, Sebastiano Vigna <vi...@di.unimi.it> wrote:

> On 3 Jun 2013, at 12:43 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > Dawid, do you have your existing benchmark code against
> fastutil/hppc/trove. Since the performance improvements we made in last
> couple of months I am itching to revisit the numbers.
>
> If you're interested, they did a thorough job here:
>
> http://blog.aggregateknowledge.com/2011/12/12/big-memory-part-4/
>
> Ciao,
>
>                                         seba
>
>

Re: Performance of primitive collections

Posted by Robin Anil <ro...@gmail.com>.
Dawid, do you have your existing benchmark code against
fastutil/hppc/trove. Since the performance improvements we made in last
couple of months I am itching to revisit the numbers.

Robin

Re: Performance of primitive collections

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
I know I'm shooting my own foot here, but fastutil is now ASL licensed and
it provides Java Collections compliance, so it should be a drop-in
replacement for former Colt collections... It is a larger library though
(but then, it provides more stuff ;).

Dawid

On Fri, Mar 4, 2011 at 3:38 PM, Benson Margulies <bi...@gmail.com>wrote:

> I am 100% in favor of replacing the colt-based thing in there now with
> something better if there is something better. I'll commit the patch,
> or help test, or whatever.
>
> I volunteered to put lipstick on the colt-based pig because I really
> wanted *some* non-GPL alternative to Trove at Apache, and the
> colt-based code sitting around Mahout was the shortest path.
>
>
> On Fri, Mar 4, 2011 at 6:11 AM, Dawid Weiss
> <da...@cs.put.poznan.pl> wrote:
> > A while ago we discussed if it'd make sense to integrate HPPC with
> Mahout, I
> > remember Ted mentioned it'd be good to have some kind of comparative
> > benchmark that would simply compare Mahout Math collections against HPPC.
> I
> > finally found the time to do it, driven by some strong updates Sebastiano
> > Vinga introduced in fastutil. But first, a disclaimer :)
> >
> > I am not trying to impose HPPC on anybody and I'm not trying to show
> which
> > library is "better" or "worse" in a general sense. I believe the
> benchmarks
> > presented below and any conclusions that can be drawn from them apply
> only
> > in a marginal sense to any algorithm complex enough to be doing something
> > ELSE other than stressing a given hash map implementation. I also believe
> > that diversity drives progress, so it's good to share ;)
> >
> > So. First, I've been thinking about how to measure hash map performance
> in
> > any sensible (read: reasonable) application. Sebastiano suggested some
> clues
> > and I came up with some other ideas, but any realistic benchmarks (in the
> > sense of data or algorithm) would be a welcome extension. I've
> implemented
> > the following benchmarks (all on int-int maps):
> >
> > - BenchmarkPut: injecting a lot of integers in a map, basically what one
> > would do to get a unique set of keys, for example. The input ints have
> > different distributions: linear, random and high-bits linear (increments
> on
> > higher bits, then on lower bits).
> >
> > - BenchmarkGetWithRemoved: injecting a lot of integers, then removing a
> > fraction of them (not benchmarked), and (benchmarked) querying this map
> for
> > a partially matching set of keys. A realistic scenario if you have
> removals
> > from the map.
> >
> > - BenchmarkBigramCounting: calculating bigram frequencies from a real
> corpus
> > of texts in Polish. The input is 10MB long, so it's not a big data set,
> but
> > the texts are realistic and hence so are bigram distributions.
> >
> > I tested the following hash map implementations (int-int primitives):
> >
> > - fastutil (6.1.0),
> > - HPPC (trunk, but without improvements resulting from these benchmarks
> yet,
> > so I'm not biased ;)
> > - Java HashMap (on boxed Integer keys)
> > - GNU Trove (3.0.0rc1),
> > - Mahout Math/Collections (1.0)
> >
> > The bigram counting also includes PCJ and fastutil's linked hash map
> > implementation for historic reasons (but these are less relevant).
> >
> > The benchmarks were done on a Windows7 machine with I7 CPU, detailed log
> > attached, using Google Caliper. Times below are in ms. Links
> > should lead you to on-line charts. Some conclusions:
> >
> > 1) BenchmarkPut,
> >
> http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkPut
> >
> > distribution implementation    ms linear runtime
> >      RANDOM           HPPC  77.9 ==========
> >      LINEAR           HPPC  79.1 ==========
> >    HIGHBITS           HPPC  78.4 ==========
> >      RANDOM       FASTUTIL  97.4 ============
> >      LINEAR       FASTUTIL 101.2 =============
> >    HIGHBITS       FASTUTIL 101.3 =============
> >      RANDOM           JAVA 232.6 ==============================
> >      LINEAR           JAVA  72.7 =========
> >    HIGHBITS           JAVA 184.2 =======================
> >      RANDOM         MAHOUT 125.9 ================
> >      LINEAR         MAHOUT  60.3 =======
> >    HIGHBITS         MAHOUT  98.4 ============
> >      RANDOM          TROVE 113.8 ==============
> >      LINEAR          TROVE  43.8 =====
> >    HIGHBITS          TROVE  83.7 ==========
> >
> > I believe additional good hashing of integer keys does help to even-out
> any
> > weird input key distributions (and these would occur a lot in practical
> > situations, I bet). fastutil uses the finalization step from MurmurHash3,
> > HPPC uses a simplified full-pass over int4 bytes from MurmurHash2, but
> these
> > yield an equal effect (the speed difference is a result of another
> thing).
> > Mahout Math currently doesn't hash ints, so it'd be probably a good
> addition
> > (at very little overhead).
> >
> > 2) BenchmarkGetWithRemoved:
> >
> http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkGetWithRemoved
> >
> > %rem.keys   implementation     ms linear runtime
> >          0           HPPC  65.73 ============
> >          0       FASTUTIL  52.63 ==========
> >          0           JAVA 156.37 ==============================
> >          0          TROVE  75.31 ==============
> >          0         MAHOUT  85.59 ================
> >       0.25           HPPC  87.19 ================
> >       0.25       FASTUTIL  58.24 ===========
> >       0.25           JAVA 149.26 ============================
> >       0.25          TROVE 123.32 =======================
> >       0.25         MAHOUT 113.77 =====================
> >        0.5           HPPC 101.70 ===================
> >        0.5       FASTUTIL  55.74 ==========
> >        0.5           JAVA 135.01 =========================
> >        0.5          TROVE  93.19 =================
> >        0.5         MAHOUT 125.16 ========================
> >       0.75           HPPC  93.98 ==================
> >       0.75       FASTUTIL  42.20 ========
> >       0.75           JAVA 101.29 ===================
> >       0.75          TROVE  76.07 ==============
> >       0.75         MAHOUT  94.67 ==================
> >          1           HPPC  60.52 ===========
> >          1       FASTUTIL  10.78 ==
> >          1           JAVA  44.30 ========
> >          1          TROVE   9.54 =
> >          1         MAHOUT  30.98 =====
> >
> > This shows two things. The first thing is that open addressing based on
> > double hashing (with REMOVED state markers) requires either automatic or
> > manual compaction (rehashing) to avoid performance degradation. HPPC
> doesn't
> > do this because we rely on the user and so the time grows with the number
> of
> > removed keys:
> >
> >          0           HPPC  65.73 ============
> >       0.25           HPPC  87.19 ================
> >        0.5           HPPC 101.70 ===================
> >       0.75           HPPC  93.98 ==================
> >          1           HPPC  60.52 ===========
> >
> > (the drop at 100% of removed keys is probably HotSpot overlearning and
> > optimizing differently). It is curious as to why Mahout's implementation
> was
> > so much slower than anything else, but I didn't have the time to
> investigate
> > yet. The only thing that comes to my mind is using modulus % for sizing
> > instead of bitmasking (restricted 2^n map sizes). Also note that even
> > regular Java does a fairly good job here (the number of keys was
> 2,000,000,
> > but even for larger inputs the results were quite similar). Also, note
> > fastutil here:
> >
> >          0       FASTUTIL  52.63 ==========
> >       0.25       FASTUTIL  58.24 ===========
> >        0.5       FASTUTIL  55.74 ==========
> >       0.75       FASTUTIL  42.20 ========
> >          1       FASTUTIL  10.78 ==
> >
> > Nice. Sebastiano reimplemented hash maps to use linear addressing instead
> of
> > double hashing and with linear addressing you can avoid REMOVED state
> > markers entirely (so no need to rehash and you'll get empty spots for
> reuse
> > if adding new elements). The key here is the combination of linear
> > addressing AND a good key hash mangling function (theoretically, with a
> > perfect hashing function, the input key distribution wouldn't affect
> linear
> > addressing performance at all -- no clusters or chains effect.
> >
> > 3) Bigrams,
> >
> http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkBigramCounting
> >
> > This is a toy example, I know, but it's the only realistic one in the
> set,
> > so it's probably worth mentioning. It is a little bit puzzling to me why
> > HPPC is (by a small margin) better than fastutil because it really
> shouldn't
> > be. This isn't exactly noise (variance is given in plain text attached),
> but
> > I still consider this a glitch within the regular variance.
> >
> > You can try these on your own machine -- I would actually appreciate if
> you
> > sent your results and feedback back to me or this list so that we can see
> > how the above holds consistent betwen architectures, processors and VMs.
> >
> > git clone git@github.com:carrotsearch/hppc.git
> > cd hppc-others
> > (cd lib/; bash ./install-deps )
> > mvn clean package
> > java -server -jar hppc-others-0.4.0-SNAPSHOT/hppc-others*.jar | tee
> > benchmark.log
> >
> > I didn't check the performance of simple lists or other structures with
> > minimal overhead because I don't imagine they being significantly faster
> or
> > slower in any library. For these other facts play a role (I personally
> like
> > the fact I can access the internal buffers in HPPC, but it may against be
> a
> > point against using it ;).
> >
> > Dawid
> >
>
>

Re: Performance of primitive collections

Posted by Benson Margulies <bi...@gmail.com>.
I am 100% in favor of replacing the colt-based thing in there now with
something better if there is something better. I'll commit the patch,
or help test, or whatever.

I volunteered to put lipstick on the colt-based pig because I really
wanted *some* non-GPL alternative to Trove at Apache, and the
colt-based code sitting around Mahout was the shortest path.


On Fri, Mar 4, 2011 at 6:11 AM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
> A while ago we discussed if it'd make sense to integrate HPPC with Mahout, I
> remember Ted mentioned it'd be good to have some kind of comparative
> benchmark that would simply compare Mahout Math collections against HPPC. I
> finally found the time to do it, driven by some strong updates Sebastiano
> Vinga introduced in fastutil. But first, a disclaimer :)
>
> I am not trying to impose HPPC on anybody and I'm not trying to show which
> library is "better" or "worse" in a general sense. I believe the benchmarks
> presented below and any conclusions that can be drawn from them apply only
> in a marginal sense to any algorithm complex enough to be doing something
> ELSE other than stressing a given hash map implementation. I also believe
> that diversity drives progress, so it's good to share ;)
>
> So. First, I've been thinking about how to measure hash map performance in
> any sensible (read: reasonable) application. Sebastiano suggested some clues
> and I came up with some other ideas, but any realistic benchmarks (in the
> sense of data or algorithm) would be a welcome extension. I've implemented
> the following benchmarks (all on int-int maps):
>
> - BenchmarkPut: injecting a lot of integers in a map, basically what one
> would do to get a unique set of keys, for example. The input ints have
> different distributions: linear, random and high-bits linear (increments on
> higher bits, then on lower bits).
>
> - BenchmarkGetWithRemoved: injecting a lot of integers, then removing a
> fraction of them (not benchmarked), and (benchmarked) querying this map for
> a partially matching set of keys. A realistic scenario if you have removals
> from the map.
>
> - BenchmarkBigramCounting: calculating bigram frequencies from a real corpus
> of texts in Polish. The input is 10MB long, so it's not a big data set, but
> the texts are realistic and hence so are bigram distributions.
>
> I tested the following hash map implementations (int-int primitives):
>
> - fastutil (6.1.0),
> - HPPC (trunk, but without improvements resulting from these benchmarks yet,
> so I'm not biased ;)
> - Java HashMap (on boxed Integer keys)
> - GNU Trove (3.0.0rc1),
> - Mahout Math/Collections (1.0)
>
> The bigram counting also includes PCJ and fastutil's linked hash map
> implementation for historic reasons (but these are less relevant).
>
> The benchmarks were done on a Windows7 machine with I7 CPU, detailed log
> attached, using Google Caliper. Times below are in ms. Links
> should lead you to on-line charts. Some conclusions:
>
> 1) BenchmarkPut,
> http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkPut
>
> distribution implementation    ms linear runtime
>      RANDOM           HPPC  77.9 ==========
>      LINEAR           HPPC  79.1 ==========
>    HIGHBITS           HPPC  78.4 ==========
>      RANDOM       FASTUTIL  97.4 ============
>      LINEAR       FASTUTIL 101.2 =============
>    HIGHBITS       FASTUTIL 101.3 =============
>      RANDOM           JAVA 232.6 ==============================
>      LINEAR           JAVA  72.7 =========
>    HIGHBITS           JAVA 184.2 =======================
>      RANDOM         MAHOUT 125.9 ================
>      LINEAR         MAHOUT  60.3 =======
>    HIGHBITS         MAHOUT  98.4 ============
>      RANDOM          TROVE 113.8 ==============
>      LINEAR          TROVE  43.8 =====
>    HIGHBITS          TROVE  83.7 ==========
>
> I believe additional good hashing of integer keys does help to even-out any
> weird input key distributions (and these would occur a lot in practical
> situations, I bet). fastutil uses the finalization step from MurmurHash3,
> HPPC uses a simplified full-pass over int4 bytes from MurmurHash2, but these
> yield an equal effect (the speed difference is a result of another thing).
> Mahout Math currently doesn't hash ints, so it'd be probably a good addition
> (at very little overhead).
>
> 2) BenchmarkGetWithRemoved:
> http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkGetWithRemoved
>
> %rem.keys   implementation     ms linear runtime
>          0           HPPC  65.73 ============
>          0       FASTUTIL  52.63 ==========
>          0           JAVA 156.37 ==============================
>          0          TROVE  75.31 ==============
>          0         MAHOUT  85.59 ================
>       0.25           HPPC  87.19 ================
>       0.25       FASTUTIL  58.24 ===========
>       0.25           JAVA 149.26 ============================
>       0.25          TROVE 123.32 =======================
>       0.25         MAHOUT 113.77 =====================
>        0.5           HPPC 101.70 ===================
>        0.5       FASTUTIL  55.74 ==========
>        0.5           JAVA 135.01 =========================
>        0.5          TROVE  93.19 =================
>        0.5         MAHOUT 125.16 ========================
>       0.75           HPPC  93.98 ==================
>       0.75       FASTUTIL  42.20 ========
>       0.75           JAVA 101.29 ===================
>       0.75          TROVE  76.07 ==============
>       0.75         MAHOUT  94.67 ==================
>          1           HPPC  60.52 ===========
>          1       FASTUTIL  10.78 ==
>          1           JAVA  44.30 ========
>          1          TROVE   9.54 =
>          1         MAHOUT  30.98 =====
>
> This shows two things. The first thing is that open addressing based on
> double hashing (with REMOVED state markers) requires either automatic or
> manual compaction (rehashing) to avoid performance degradation. HPPC doesn't
> do this because we rely on the user and so the time grows with the number of
> removed keys:
>
>          0           HPPC  65.73 ============
>       0.25           HPPC  87.19 ================
>        0.5           HPPC 101.70 ===================
>       0.75           HPPC  93.98 ==================
>          1           HPPC  60.52 ===========
>
> (the drop at 100% of removed keys is probably HotSpot overlearning and
> optimizing differently). It is curious as to why Mahout's implementation was
> so much slower than anything else, but I didn't have the time to investigate
> yet. The only thing that comes to my mind is using modulus % for sizing
> instead of bitmasking (restricted 2^n map sizes). Also note that even
> regular Java does a fairly good job here (the number of keys was 2,000,000,
> but even for larger inputs the results were quite similar). Also, note
> fastutil here:
>
>          0       FASTUTIL  52.63 ==========
>       0.25       FASTUTIL  58.24 ===========
>        0.5       FASTUTIL  55.74 ==========
>       0.75       FASTUTIL  42.20 ========
>          1       FASTUTIL  10.78 ==
>
> Nice. Sebastiano reimplemented hash maps to use linear addressing instead of
> double hashing and with linear addressing you can avoid REMOVED state
> markers entirely (so no need to rehash and you'll get empty spots for reuse
> if adding new elements). The key here is the combination of linear
> addressing AND a good key hash mangling function (theoretically, with a
> perfect hashing function, the input key distribution wouldn't affect linear
> addressing performance at all -- no clusters or chains effect.
>
> 3) Bigrams,
> http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkBigramCounting
>
> This is a toy example, I know, but it's the only realistic one in the set,
> so it's probably worth mentioning. It is a little bit puzzling to me why
> HPPC is (by a small margin) better than fastutil because it really shouldn't
> be. This isn't exactly noise (variance is given in plain text attached), but
> I still consider this a glitch within the regular variance.
>
> You can try these on your own machine -- I would actually appreciate if you
> sent your results and feedback back to me or this list so that we can see
> how the above holds consistent betwen architectures, processors and VMs.
>
> git clone git@github.com:carrotsearch/hppc.git
> cd hppc-others
> (cd lib/; bash ./install-deps )
> mvn clean package
> java -server -jar hppc-others-0.4.0-SNAPSHOT/hppc-others*.jar | tee
> benchmark.log
>
> I didn't check the performance of simple lists or other structures with
> minimal overhead because I don't imagine they being significantly faster or
> slower in any library. For these other facts play a role (I personally like
> the fact I can access the internal buffers in HPPC, but it may against be a
> point against using it ;).
>
> Dawid
>

Performance of primitive collections

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
A while ago we discussed if it'd make sense to integrate HPPC with Mahout, I
remember Ted mentioned it'd be good to have some kind of comparative
benchmark that would simply compare Mahout Math collections against HPPC. I
finally found the time to do it, driven by some strong updates Sebastiano
Vinga introduced in fastutil. But first, a disclaimer :)

I am not trying to impose HPPC on anybody and I'm not trying to show which
library is "better" or "worse" in a general sense. I believe the benchmarks
presented below and any conclusions that can be drawn from them apply only
in a marginal sense to any algorithm complex enough to be doing something
ELSE other than stressing a given hash map implementation. I also believe
that diversity drives progress, so it's good to share ;)

So. First, I've been thinking about how to measure hash map performance in
any sensible (read: reasonable) application. Sebastiano suggested some clues
and I came up with some other ideas, but any realistic benchmarks (in the
sense of data or algorithm) would be a welcome extension. I've implemented
the following benchmarks (all on int-int maps):

- BenchmarkPut: injecting a lot of integers in a map, basically what one
would do to get a unique set of keys, for example. The input ints have
different distributions: linear, random and high-bits linear (increments on
higher bits, then on lower bits).

- BenchmarkGetWithRemoved: injecting a lot of integers, then removing a
fraction of them (not benchmarked), and (benchmarked) querying this map for
a partially matching set of keys. A realistic scenario if you have removals
from the map.

- BenchmarkBigramCounting: calculating bigram frequencies from a real corpus
of texts in Polish. The input is 10MB long, so it's not a big data set, but
the texts are realistic and hence so are bigram distributions.

I tested the following hash map implementations (int-int primitives):

- fastutil (6.1.0),
- HPPC (trunk, but without improvements resulting from these benchmarks yet,
so I'm not biased ;)
- Java HashMap (on boxed Integer keys)
- GNU Trove (3.0.0rc1),
- Mahout Math/Collections (1.0)

The bigram counting also includes PCJ and fastutil's linked hash map
implementation for historic reasons (but these are less relevant).

The benchmarks were done on a Windows7 machine with I7 CPU, detailed log
attached, using Google Caliper. Times below are in ms. Links
should lead you to on-line charts. Some conclusions:

1) BenchmarkPut,
http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkPut

distribution implementation    ms linear runtime
      RANDOM           HPPC  77.9 ==========
      LINEAR           HPPC  79.1 ==========
    HIGHBITS           HPPC  78.4 ==========
      RANDOM       FASTUTIL  97.4 ============
      LINEAR       FASTUTIL 101.2 =============
    HIGHBITS       FASTUTIL 101.3 =============
      RANDOM           JAVA 232.6 ==============================
      LINEAR           JAVA  72.7 =========
    HIGHBITS           JAVA 184.2 =======================
      RANDOM         MAHOUT 125.9 ================
      LINEAR         MAHOUT  60.3 =======
    HIGHBITS         MAHOUT  98.4 ============
      RANDOM          TROVE 113.8 ==============
      LINEAR          TROVE  43.8 =====
    HIGHBITS          TROVE  83.7 ==========

I believe additional good hashing of integer keys does help to even-out any
weird input key distributions (and these would occur a lot in practical
situations, I bet). fastutil uses the finalization step from MurmurHash3,
HPPC uses a simplified full-pass over int4 bytes from MurmurHash2, but these
yield an equal effect (the speed difference is a result of another thing).
Mahout Math currently doesn't hash ints, so it'd be probably a good addition
(at very little overhead).

2) BenchmarkGetWithRemoved:
http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkGetWithRemoved

%rem.keys   implementation     ms linear runtime
          0           HPPC  65.73 ============
          0       FASTUTIL  52.63 ==========
          0           JAVA 156.37 ==============================
          0          TROVE  75.31 ==============
          0         MAHOUT  85.59 ================
       0.25           HPPC  87.19 ================
       0.25       FASTUTIL  58.24 ===========
       0.25           JAVA 149.26 ============================
       0.25          TROVE 123.32 =======================
       0.25         MAHOUT 113.77 =====================
        0.5           HPPC 101.70 ===================
        0.5       FASTUTIL  55.74 ==========
        0.5           JAVA 135.01 =========================
        0.5          TROVE  93.19 =================
        0.5         MAHOUT 125.16 ========================
       0.75           HPPC  93.98 ==================
       0.75       FASTUTIL  42.20 ========
       0.75           JAVA 101.29 ===================
       0.75          TROVE  76.07 ==============
       0.75         MAHOUT  94.67 ==================
          1           HPPC  60.52 ===========
          1       FASTUTIL  10.78 ==
          1           JAVA  44.30 ========
          1          TROVE   9.54 =
          1         MAHOUT  30.98 =====

This shows two things. The first thing is that open addressing based on
double hashing (with REMOVED state markers) requires either automatic or
manual compaction (rehashing) to avoid performance degradation. HPPC doesn't
do this because we rely on the user and so the time grows with the number of
removed keys:

          0           HPPC  65.73 ============
       0.25           HPPC  87.19 ================
        0.5           HPPC 101.70 ===================
       0.75           HPPC  93.98 ==================
          1           HPPC  60.52 ===========

(the drop at 100% of removed keys is probably HotSpot overlearning and
optimizing differently). It is curious as to why Mahout's implementation was
so much slower than anything else, but I didn't have the time to investigate
yet. The only thing that comes to my mind is using modulus % for sizing
instead of bitmasking (restricted 2^n map sizes). Also note that even
regular Java does a fairly good job here (the number of keys was 2,000,000,
but even for larger inputs the results were quite similar). Also, note
fastutil here:

          0       FASTUTIL  52.63 ==========
       0.25       FASTUTIL  58.24 ===========
        0.5       FASTUTIL  55.74 ==========
       0.75       FASTUTIL  42.20 ========
          1       FASTUTIL  10.78 ==

Nice. Sebastiano reimplemented hash maps to use linear addressing instead of
double hashing and with linear addressing you can avoid REMOVED state
markers entirely (so no need to rehash and you'll get empty spots for reuse
if adding new elements). The key here is the combination of linear
addressing AND a good key hash mangling function (theoretically, with a
perfect hashing function, the input key distribution wouldn't affect linear
addressing performance at all -- no clusters or chains effect.

3) Bigrams,
http://dawidweiss123.appspot.com/run/dawid.weiss@gmail.com/com.carrotsearch.hppc.caliper.BenchmarkBigramCounting

This is a toy example, I know, but it's the only realistic one in the set,
so it's probably worth mentioning. It is a little bit puzzling to me why
HPPC is (by a small margin) better than fastutil because it really shouldn't
be. This isn't exactly noise (variance is given in plain text attached), but
I still consider this a glitch within the regular variance.

You can try these on your own machine -- I would actually appreciate if you
sent your results and feedback back to me or this list so that we can see
how the above holds consistent betwen architectures, processors and VMs.

git clone git@github.com:carrotsearch/hppc.git
cd hppc-others
(cd lib/; bash ./install-deps )
mvn clean package
java -server -jar hppc-others-0.4.0-SNAPSHOT/hppc-others*.jar | tee
benchmark.log

I didn't check the performance of simple lists or other structures with
minimal overhead because I don't imagine they being significantly faster or
slower in any library. For these other facts play a role (I personally like
the fact I can access the internal buffers in HPPC, but it may against be a
point against using it ;).

Dawid

Re: Performance of primitive collections

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> git clone git@github.com:carrotsearch/hppc.git
> Cloning into hppc...
> Permission denied (publickey).
>

Oops, sorry -- the address should be for read-only clones:
git clone git://github.com/carrotsearch/hppc.git

You can also browse on-line here:
https://github.com/carrotsearch/hppc

Dawid