You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/04/21 02:05:24 UTC
Re: a note on multiword hashed tokens and collisions (fwd)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Daniel Quinlan writes:
> Sidney Markowitz <si...@sidney.com> writes:
>
> > But I think that my numbers show that 40-bit should be ok at 4 million
> > and certainly at 3 million.
>
> I think it's perfectly acceptable to have a small number of collisions
> and that size savings is a much more important factor.
yeah. Don't forget, I was forwarding this as a datapoint for *multi*-word
token use, which produces way more tokens.
> Bear in mind, not only does a collision have to happen, but it has to
> change the actual result *in the wrong direction* before we actually
> start caring.
The bogofilter/CRM-114 forward was pretty clear that collisions in
multiword token use caused FPs: 'the hash collisions quickly caused
outrageously bad classification mistakes'.
- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS
iD8DBQFAhbrEQTcbUG5Y7woRAgmtAJwMENiwuj0p5CJnapkFwmmt53DzCQCg6dnk
fqN7PN4MnnUs4rjVk1ZOJcE=
=TkyY
-----END PGP SIGNATURE-----
Re: a note on multiword hashed tokens and collisions (fwd)
Posted by Sidney Markowitz <si...@sidney.com>.
Sidney Markowitz wrote:
> Yes, but they are using 4 bits less in the hash function
Typo. 40 - 32 = 8 bits, not four :-)
Re: a note on multiword hashed tokens and collisions (fwd)
Posted by Sidney Markowitz <si...@sidney.com>.
Justin Mason wrote:
> The bogofilter/CRM-114 forward was pretty clear that collisions in
> multiword token use caused FPs: 'the hash collisions quickly caused
> outrageously bad classification mistakes'.
Yes, but they are using 4 bits less in the hash function and multiplying
their number of tokens by 16 by generating multiword tokens. And there
is an exponential effect. My numbers show us getting something like 16
collisions (32 tokens) out of 4 million using 40 bits. The same
calculations show on the order of two million colliding tokens when you
use a 32 bit hash on a multiword database generated from a 4 million
single token base. No wonder they have a problem. They are ignoring the
proper use of hash functions.
-- sidney