You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/04/20 07:24:53 UTC

a note on multiword hashed tokens and collisions (fwd)

just found this when looking up details of "train on error" -- 
http://mail.python.org/pipermail/spambayes/2002-December/002440.html :

  > Are you hashing tokens?  spambayes does not, CRM114 does.  Bill
  > generates about 16 hash codes per input token, and with just a million
  > hash buckets, collision rates zoom quickly if you train on everything.

  Understood.  We don't hash tokens, and I agree that the sentence you
  quoted is misleading; I should have said something like "bogofilter's
  current tokenization and the R-F classification method."  I didn't try
  any of bogofilter's other calculation methods.

  > The experiments spambayes did with CRM114-like schemes were a disaster
  > due to this -- we continued to train on everything, with hashing but
  > without any bounds on bucket count, and the hash collisions quickly
  > caused outrageously bad classification mistakes. Removing the hashing
  > cured that, but then the database size goes through the roof (when
  > generating ~16 "exact strings" per input token, and training on
  > everything).

  Yup.
    
  > Training-on-error helps Bill because it slashes hash collisions,
  > simply via producing far fewer hash codes than does training on
  > everything.

  I didn't mean to imply otherwise, and your correction of my sloppy
  wording is appreciated.


So worth noting -- CRM-114 has had to adopt special training strategies
to avoid collisions when using hashed multiword tokens.

--j.

Re: a note on multiword hashed tokens and collisions (fwd)

Posted by Daniel Quinlan <qu...@pathname.com>.
Sidney Markowitz <si...@sidney.com> writes:

> But I think that my numbers show that 40-bit should be ok at 4 million 
> and certainly at 3 million.

I think it's perfectly acceptable to have a small number of collisions
and that size savings is a much more important factor.

Bear in mind, not only does a collision have to happen, but it has to
change the actual result *in the wrong direction* before we actually
start caring.

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Re: a note on multiword hashed tokens and collisions (fwd)

Posted by Michael Parker <pa...@pobox.com>.
On Wed, Apr 21, 2004 at 10:15:13AM +1200, Sidney Markowitz wrote:
> Michael Parker wrote:
> >Question is, is using that value gonna work in the long run
> >for dbs with 3-4 million tokens?
> 
> substr(sha1($token), -5) and CHAR(5) is good to about 2 million using my 
> no-brainer criteria of expecting no collisions at all.
> 

[ Snip Sidney's excellent analysis ]

Ahhh...ok it's much clearer now, thanks.

> 
> substr(sha1($token), -6) and CHAR(6) is no-brainer good to about 32 
> million tokens.
> 
> Can you try your benchmark with that? I expect that it would get you the 
> same performance as the 40 bit version and still enough reduction in db 
> size to be worthwhile.
> 

It should only really affect size (~5% greater than the CHAR(5)
version), so I didn't do a full benchmark.

> But I think that my numbers show that 40-bit should be ok at 4 million 
> and certainly at 3 million.

I think the overlap of a few tokens is fine and I'm gonna move forward
with the CHAR(5) version.

For SQL we gain a 40% space savings over the previous method (this
includes switching to int userid instead of username) using MyISAM
tables in MySQL.

Michael

Re: a note on multiword hashed tokens and collisions (fwd)

Posted by Sidney Markowitz <si...@sidney.com>.
Michael Parker wrote:
> Question is, is using that value gonna work in the long run
> for dbs with 3-4 million tokens?

substr(sha1($token), -5) and CHAR(5) is good to about 2 million using my 
no-brainer criteria of expecting no collisions at all.

Ok, time for me to do some math for the non-no-brainer case :-)

[looking things up... writing simulations in Lisp... ok, done]

When you get to 3 million tokens in the db you can expect 1 or 2 
collisions. At 4 million, 8 to 16 collisions. A collision means that two 
tokens which are different are treated as the same token.

So worse case, with a db of 4 million tokens, 32 tokens, grouped in 16 
pairs, are being classified incorrectly. This can only make a difference 
if one token of a colliding pair is a significant ham or spam sign and 
the other is either not significant or is significant in the opposite way.

If you pick 150 tokens at random to evaluate a message, the probability 
of choosing one of the colliding tokens is 150 * (32 / 4000000) = 0.0012

That means that you have a 1 in a thousand chance of having one token 
out of the 150 contributing an inaccurate number to the calculation of 
the Bayes score of a message. 1 in a million that two of them will be 
wrong. Except that the probability is even lower, since the 150 most 
significant tokens are selected and there is a low probability that any 
of the 32 colliding tokens are ones that have a high significance.

Now if that chance of collision is too high for you or you want to stick 
to no-brainer numbers, you only need two more bits in the hash to double 
the no-brainer limit of 2 million up to 4 million. Of course we don't 
have two bits available without going up another whole byte, which makes it:

substr(sha1($token), -6) and CHAR(6) is no-brainer good to about 32 
million tokens.

Can you try your benchmark with that? I expect that it would get you the 
same performance as the 40 bit version and still enough reduction in db 
size to be worthwhile.

But I think that my numbers show that 40-bit should be ok at 4 million 
and certainly at 3 million.

  -- sidney

Re: a note on multiword hashed tokens and collisions (fwd)

Posted by Michael Parker <pa...@pobox.com>.
I absolutely suck at math so I'm not even gonna think about it.  But
I've seen token databases with 3+ million tokens.

I've got bayes code running now using hashes, using Sidney's
substr(sha1($token), -5) value.  It provides a slight speedup (maybe
10-20%) on scanning.  With the corpus I'm using I see ~7% decrease in
db size (for DBM, I haven't looked SQL dbs yet), but my average toen
size has been around 9 chars so we're not exactly shrinking things a
ton.

Question is, is using that value gonna work in the long run for dbs
with 3-4 million tokens?

Michael

PS Interesting factoid, on my benchmark corpus (6000 ham and 6000
spam) we extract ~120k more tokens in 3.0.0 than we did in 2.63.

Re: a note on multiword hashed tokens and collisions (fwd)

Posted by Sidney Markowitz <si...@sidney.com>.
Justin Mason wrote:
> So worth noting -- CRM-114 has had to adopt special training strategies
> to avoid collisions when using hashed multiword tokens.

I looked up the details of CRM-114 tokenization and hashing. They 
generate 16 multiword tokens per parsed token (as it said in the piece 
you quoted) and the hash is only 32 bits. Birthday paradox says to 
expect collisions starting at 2 times the square root of the hash space, 
which means once you have 2^17 multiword tokens. That roughly compares 
to having 1/16 that many single tokens in a database, or 2^13.

Think what it would be like if we ran into collisions once we hit a 
little over 8,000 distinct tokens in the database. Of course CRM114 has 
problems with it.

This isn't rocket science. I don't know what those people are thinking. 
If you want to use a hash with low probability of collisions, you use 
enough bits in the hash function. A good rule of thumb is at least twice 
as many bits in the hash function as in the number of items that you are 
hashing. So the 40 bit hash I was using allows for a database of around 
a million unique tokens. It's my understanding that a large Bayes db 
would be under half a million. If we go to multiword hashes like CRM114 
that multiply the number of tokens by 16 we should add eight bits to the 
hash function size. So the db would go from 5n bytes required for the 
tokens to (n * 16) * 6 = 96n, so would be 96/5 or almost 20 times 
bigger. Something to keep in mind if we consider multiword tokens.

  -- sidney