You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/04/21 02:05:24 UTC

Re: a note on multiword hashed tokens and collisions (fwd)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Daniel Quinlan writes:
> Sidney Markowitz <si...@sidney.com> writes:
> 
> > But I think that my numbers show that 40-bit should be ok at 4 million 
> > and certainly at 3 million.
> 
> I think it's perfectly acceptable to have a small number of collisions
> and that size savings is a much more important factor.

yeah.  Don't forget, I was forwarding this as a datapoint for *multi*-word
token use, which produces way more tokens.

> Bear in mind, not only does a collision have to happen, but it has to
> change the actual result *in the wrong direction* before we actually
> start caring.

The bogofilter/CRM-114 forward was pretty clear that collisions in
multiword token use caused FPs: 'the hash collisions quickly caused
outrageously bad classification mistakes'.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAhbrEQTcbUG5Y7woRAgmtAJwMENiwuj0p5CJnapkFwmmt53DzCQCg6dnk
fqN7PN4MnnUs4rjVk1ZOJcE=
=TkyY
-----END PGP SIGNATURE-----


Re: a note on multiword hashed tokens and collisions (fwd)

Posted by Sidney Markowitz <si...@sidney.com>.
Sidney Markowitz wrote:
> Yes, but they are using 4 bits less in the hash function

Typo. 40 - 32 = 8 bits, not four :-)


Re: a note on multiword hashed tokens and collisions (fwd)

Posted by Sidney Markowitz <si...@sidney.com>.
Justin Mason wrote:
> The bogofilter/CRM-114 forward was pretty clear that collisions in
> multiword token use caused FPs: 'the hash collisions quickly caused
> outrageously bad classification mistakes'.

Yes, but they are using 4 bits less in the hash function and multiplying 
their number of tokens by 16 by generating multiword tokens. And there 
is an exponential effect. My numbers show us getting something like 16 
collisions (32 tokens) out of 4 million using 40 bits. The same 
calculations show on the order of two million colliding tokens when you 
use a 32 bit hash on a multiword database generated from a 4 million 
single token base. No wonder they have a problem. They are ignoring the 
proper use of hash functions.

  -- sidney