You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Mikkel Fahnøe Jørgensen <mi...@yahoo.dk> on 2002/08/17 13:29:15 UTC

Fwd: hash in subversions vdelta.c

Forward, I originally sent this comment to Brane who
asked to forward to this list. I'm not on the list.
The comment relates to a discussion in the forum late
2000 I found searching on vdelta. (FYI: I just
discovered zdelta http://cis.poly.edu/zdelta/ , but it
is more complex than vdiff).

Regards Mikkel

 --- Mikkel Fahnøe Jørgensen <mi...@yahoo.dk>
skrev: > Dato: Sat, 17 Aug 2002 12:42:33 +0200 (CEST)
> Fra: Mikkel Fahnøe Jørgensen <mi...@yahoo.dk>
> Emne: hash in subversions vdelta.c
> Til: brane@xbc.nu
> 
> Regarding subversion, vdelta source
> 
> I saw your discussion on hash functions in vdelta.c
> (see snippet below)
> 
> I assume the h = h * 127 + *k++ is the faster
> version
> (it's also in the current source).
> The C-compiler optimizes this hash to h = (h << 8) -
> 1
> + *k++, whereas the 97 version probably needs a
> multiplication. For a tight loop this may make a
> difference. Since the hash loops 4 times, the
> compiler
> may even unroll the loop. Depending on processor and
> byte alignment issues, this may even reduce to
> loading
> the 4 bytes into memory in one operation and
> subtracting a constant from the result. This is
> orders
> of magnitudes faster than looping 4 times doing
> multiplications with 97 and adding a constant each
> time.
> 
> I have also attached my 32bit version of Bob Jenkins
> hash function. It is slower but generates fewer
> conflicts (not important here as the conflicts
> mostly
> come from looking only at a prefix). It does not
> require that hashtable to be a multiple of a prime
> (you can just take modulo a power of two).
> 
> Regards Mikkel
> 
> ---
>     [brane@lux tests]$ perl stat.pl < stats-1    # h
> =
> h * 127 + *k++
>     Average load:         51.21
>     Average collisions:   59728.10
>     Corrected collisions: 1158.74
>     [brane@lux tests]$ perl stat.pl < stats-2    # h
> =
> h * 97 + *k++ + 41
>     Average load:         51.22
>     Average collisions:   59716.55
>     Corrected collisions: 1157.68
> 
> Now if someone will volunteer to add some /real/
> analysis code
> (average chain length, deviation, etc., etc.), I'd
> be
> interested
> to see the results. :-)
> ---
> 
> 
> /*
>    Based on Bob Jenkins lookup2.c file, hash2
> function,
>    modified to specifically handle 32bit keys
>    i.e. hash2 with length=1, initval=0
>    note that according to Jenkins, the hashtable
> should be a power
>    of two with excessive bits masked. That is, no
> prime number required.
> 
>    http://burtleburtle.net/bob/c/lookup2.c
>    http://burtleburtle.net/bob/hash/hashfaq.html
> */
> typedef unsigned long ub4; /* unsigned 4 byte value
> (32 bit) */
> inline ub4 HashId(ub4 k)
> {
>     ub4 a,b,c;
> 
>     /* Set up the internal state */
> 
>     a = b = 0x9e3779b9;  /* the golden ratio; an
> arbitrary value */
>     c = 1;
>     a += k;
> 
>     /* This is Bob Jenkins mix(a,b,c) macro embedded
> */
>     a -= b; a -= c; a ^= (c>>13);
>     b -= c; b -= a; b ^= (a<<8);
>     c -= a; c -= b; c ^= (b>>13);
>     a -= b; a -= c; a ^= (c>>12);
>     b -= c; b -= a; b ^= (a<<16);
>     c -= a; c -= b; c ^= (b>>5);
>     a -= b; a -= c; a ^= (c>>3);
>     b -= c; b -= a; b ^= (a<<10);
>     c -= a; c -= b; c ^= (b>>15);
> 
>     /*--------------------------------------------
> report the result */
>     return c;
> }
> 
> 
> 
> Få den nye Yahoo! Messenger på
> www.yahoo.dk/messenger
> Nu med webkamera, talechat, interaktive baggrunde og
> meget mere!
>  

Få den nye Yahoo! Messenger på www.yahoo.dk/messenger
Nu med webkamera, talechat, interaktive baggrunde og meget mere!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Fwd: hash in subversions vdelta.c

Posted by Mikkel Fahnøe Jørgensen <mi...@yahoo.dk>.

 --- Branko Äibej <br...@xbc.nu> skrev: > Mikkel
FahnÃ¸e JÃ¸rgensen wrote:

> Well, the hash table in vdelta.c is a bit special in
> that it expects 
> conflicts, because it can hold duplicate keys. In
> fact, without that 
> feature, the VDelta algorithm wouldn't work at all.
> :-)

There is nothing special about that. Any open hash
uses a link to deal with conflicts. VDelta is special
because some conflicts are intentional - these all
come from using the same key. You are just
implementing a multimap. You still want to reduce the
number of conflicts from the fingerprint hash when the
input keys are not identical. This is accomplished by
using a good hash function. In your source you could
choose the table size to be module of a prime and
you'd get less undesired conflicts. This approach
makes it difficult to grow the hash - which is not a
problem with the mix hash because it does not require
a prime. Apparently you do not grow the table at all -
this will break the hashtable performance above a
certain level - unless you use windows.

> >If VDelta is using a 64K window (or whatever) you
> get
> >an efficient upper limit on the hashtable which
> >affects the choice of indexing method. In
> particular
> >you can oversize a hash table to reduce conflicts
> and
> >you do not need to grow the hash nor risc conflicts
> >with virtual memory.
> >  
> >
> Take a look at the hash table implementation, and
> the commentary at the 
> top of vdelta.c that describes it. It's not a
> general-purpose table, as 
> you seem to expect.

See aslo above.

VDelta can work with windowed buffers even if your
implementation does not. If you look at the ZDelta
implementation that is largely based on VDelta, you'll
find it uses a window. This both avoids degrading hash
performance and allows processing large files
buffered.

http://cis.poly.edu/zdelta/
http://cis.poly.edu/tr/tr-cis-2002-02.shtml

I have attached my IdHashTable.h. It does handle
multiple keys and it actually also maintains an
ordered collection - but that feature is not
particularly relevant to vdelta.
These two functions iterate over exactly the conflicts
that are of interest to vdelta and skips spurious
conflicts:

    long GetIndexOf(long id)
    long GetNextIndexOf(long id, long index)

For performance reasons you'd specialize this code for
vdelta lookup, but the hash is well suited for the
purpose.

Mikkel

Få den nye Yahoo! Messenger på www.yahoo.dk/messenger
Nu med webkamera, talechat, interaktive baggrunde og meget mere!

Re: Fwd: hash in subversions vdelta.c

Posted by Branko Čibej <br...@xbc.nu>.

Mikkel Fahnøe Jørgensen wrote:

>>BTW, did you do any performance comparisons between
>>the current code and 
>>the mix hash? I'd be very interested in seeing some
>>numbers. (Our 
>>randm-test.c could probably be used for that.)
>>    
>>
>
>No. A long time ago I replaced a hash not unlike yours
>with the hash in the Compiler Dragon book for use in a
>symbol table. It dramatically improved performance due
>to reduced conflicts. Yet the classic Dragon book hash
>is not considered to be very efficient at avoiding
>conflicts. This is for variable length strings and
>your case is for 32bit values, so it cannot be
>compared directly. It is cheap to test conflicts for
>example.
>
Well, the hash table in vdelta.c is a bit special in that it expects 
conflicts, because it can hold duplicate keys. In fact, without that 
feature, the VDelta algorithm wouldn't work at all. :-)

>If you are doing a modulo prime on the hash result,
>
We aren't doung a modulo prime. We're doing modulo the table size to 
convert the hash value into an index, that's all.

>the mix hash could well turn out to be faster because
>it only requires an And operation to clamp the hash
>key into range. Otherwise it clearly has more
>instructions than unrolling 4 shift/add. My best guess
>is that the mix hash will at the latest perform better
>when conflicts become significant. However, depending
>on the avarage shared run in the data, the fact that
>only 4 chars are being hashed may generate the bulk of
>conflicts, independently of the hash function in use.
>
>If VDelta is using a 64K window (or whatever) you get
>an efficient upper limit on the hashtable which
>affects the choice of indexing method. In particular
>you can oversize a hash table to reduce conflicts and
>you do not need to grow the hash nor risc conflicts
>with virtual memory.
>  
>
Take a look at the hash table implementation, and the commentary at the 
top of vdelta.c that describes it. It's not a general-purpose table, as 
you seem to expect.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Fwd: hash in subversions vdelta.c

Posted by Mikkel Fahnøe Jørgensen <mi...@yahoo.dk>.

  --- Branko Äibej <br...@xbc.nu> skrev: > Mikkel
FahnÃ¸e JÃ¸rgensen wrote:

> >>The C-compiler optimizes this hash to h = (h << 8)
> -
> >>1
> >>+ *k++,
> >>
> 
> Actually, this should be "h = (h << 7) - h + *k++".
> It would be faster 
> to multiply by 129, then it would become "h += (h <<
> 7) + *k++".

Yes, I realized this shortly after posting the mail
but recognized you'd figure it out. It also has the
benefit of not overflowing the result during shift the
last shift. It does invalidate my argument that it can
trivially load 4 bytes at once.

> 
> >> whereas the 97 version probably needs a
> >>multiplication.
> >>
> Interestingly, the multiplication that converts to a
> shift-plus-add can 
> be faster even without optimization, because
> (presumably) executing the 
> instruction itself takes less time.

That was the idea.

> "Orders of magniture" may be right for the hash
> function itself, but 
> that's just a small part of the total calculation.

But apparently a significant part. The speed
significance of changing hash function is interesting.

> Interesting. I'll definitely look into this if
> hashing turns out to be a 
> bottleneck. For now, though I think clarity is more
> important.

Certainly.

> >> It does not
> >>require that hashtable to be a multiple of a prime
> >>(you can just take modulo a power of two).
> >>
> No good hash function requires a prime modulo. Most
> 2-universal 
> multiplicative hash functions (i.e., those with an
> odd multiplier -- 
> which is what we're using now) definitely don't.

This is a different kind of hash. I do not claim to be
an expert on this topic, I only cite what Bob Jenkins
claims. He designed the hash to not require taking
modulo subsequently. He did not argue that it wouldn't
be better taking modulo though - but it was a good
tradeoff that would provide a fast convenient and
efficient hash. I suspect the prime nature is somehow
coded into the hash - he did a brute force search
through a large space to find the hash function
meeting his criteria. You should also weigh the cost
of modulo prime against the cost of slightly increased
level of conflicts.

BTW the version I posted was optimized for 32bit
words. But since the input of VDelta is not aligned,
the byte aligned variable length hash may be a better
starting point. (See Jenkins homepage
http://burtleburtle.net/bob/hash/doobs.html).

> BTW, did you do any performance comparisons between
> the current code and 
> the mix hash? I'd be very interested in seeing some
> numbers. (Our 
> randm-test.c could probably be used for that.)

No. A long time ago I replaced a hash not unlike yours
with the hash in the Compiler Dragon book for use in a
symbol table. It dramatically improved performance due
to reduced conflicts. Yet the classic Dragon book hash
is not considered to be very efficient at avoiding
conflicts. This is for variable length strings and
your case is for 32bit values, so it cannot be
compared directly. It is cheap to test conflicts for
example.

If you are doing a modulo prime on the hash result,
the mix hash could well turn out to be faster because
it only requires an And operation to clamp the hash
key into range. Otherwise it clearly has more
instructions than unrolling 4 shift/add. My best guess
is that the mix hash will at the latest perform better
when conflicts become significant. However, depending
on the avarage shared run in the data, the fact that
only 4 chars are being hashed may generate the bulk of
conflicts, independently of the hash function in use.

If VDelta is using a 64K window (or whatever) you get
an efficient upper limit on the hashtable which
affects the choice of indexing method. In particular
you can oversize a hash table to reduce conflicts and
you do not need to grow the hash nor risc conflicts
with virtual memory.

Mikkel

Få den nye Yahoo! Messenger på www.yahoo.dk/messenger
Nu med webkamera, talechat, interaktive baggrunde og meget mere!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Fwd: hash in subversions vdelta.c

Posted by Branko Čibej <br...@xbc.nu>.

Mikkel Fahnøe Jørgensen wrote:

>Forward, I originally sent this comment to Brane who
>asked to forward to this list. I'm not on the list.
>The comment relates to a discussion in the forum late
>2000 I found searching on vdelta. (FYI: I just
>discovered zdelta http://cis.poly.edu/zdelta/ , but it
>is more complex than vdiff).
>
>Regards Mikkel
>
> --- Mikkel Fahnøe Jørgensen <mi...@yahoo.dk>
>skrev: > Dato: Sat, 17 Aug 2002 12:42:33 +0200 (CEST)
>  
>
>>Fra: Mikkel Fahnøe Jørgensen <mi...@yahoo.dk>
>>Emne: hash in subversions vdelta.c
>>Til: brane@xbc.nu
>>
>>Regarding subversion, vdelta source
>>
>>I saw your discussion on hash functions in vdelta.c
>>(see snippet below)
>>
>>I assume the h = h * 127 + *k++ is the faster
>>version
>>(it's also in the current source).
>>The C-compiler optimizes this hash to h = (h << 8) -
>>1
>>+ *k++,
>>

Actually, this should be "h = (h << 7) - h + *k++". It would be faster 
to multiply by 129, then it would become "h += (h << 7) + *k++".

>> whereas the 97 version probably needs a
>>multiplication.
>>
Interestingly, the multiplication that converts to a shift-plus-add can 
be faster even without optimization, because (presumably) executing the 
instruction itself takes less time.

>> For a tight loop this may make a
>>difference. Since the hash loops 4 times, the
>>compiler
>>may even unroll the loop.
>>
I fully expect the compiler to unroll this.

>> Depending on processor and
>>byte alignment issues, this may even reduce to
>>loading
>>the 4 bytes into memory in one operation and
>>subtracting a constant from the result. This is
>>orders
>>of magnitudes faster than looping 4 times doing
>>multiplications with 97 and adding a constant each
>>time.
>>
"Orders of magniture" may be right for the hash function itself, but 
that's just a small part of the total calculation.

>>I have also attached my 32bit version of Bob Jenkins
>>hash function. It is slower but generates fewer
>>conflicts (not important here as the conflicts
>>mostly come from looking only at a prefix).
>>
Interesting. I'll definitely look into this if hashing turns out to be a 
bottleneck. For now, though I think clarity is more important.

>> It does not
>>require that hashtable to be a multiple of a prime
>>(you can just take modulo a power of two).
>>
No good hash function requires a prime modulo. Most 2-universal 
multiplicative hash functions (i.e., those with an odd multiplier -- 
which is what we're using now) definitely don't.

[snip]


BTW, did you do any performance comparisons between the current code and 
the mix hash? I'd be very interested in seeing some numbers. (Our 
randm-test.c could probably be used for that.)

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org