You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/04/20 07:24:53 UTC
a note on multiword hashed tokens and collisions (fwd)
just found this when looking up details of "train on error" --
http://mail.python.org/pipermail/spambayes/2002-December/002440.html :
> Are you hashing tokens? spambayes does not, CRM114 does. Bill
> generates about 16 hash codes per input token, and with just a million
> hash buckets, collision rates zoom quickly if you train on everything.
Understood. We don't hash tokens, and I agree that the sentence you
quoted is misleading; I should have said something like "bogofilter's
current tokenization and the R-F classification method." I didn't try
any of bogofilter's other calculation methods.
> The experiments spambayes did with CRM114-like schemes were a disaster
> due to this -- we continued to train on everything, with hashing but
> without any bounds on bucket count, and the hash collisions quickly
> caused outrageously bad classification mistakes. Removing the hashing
> cured that, but then the database size goes through the roof (when
> generating ~16 "exact strings" per input token, and training on
> everything).
Yup.
> Training-on-error helps Bill because it slashes hash collisions,
> simply via producing far fewer hash codes than does training on
> everything.
I didn't mean to imply otherwise, and your correction of my sloppy
wording is appreciated.
So worth noting -- CRM-114 has had to adopt special training strategies
to avoid collisions when using hashed multiword tokens.
--j.
Re: a note on multiword hashed tokens and collisions (fwd)
Posted by Daniel Quinlan <qu...@pathname.com>.
Sidney Markowitz <si...@sidney.com> writes:
> But I think that my numbers show that 40-bit should be ok at 4 million
> and certainly at 3 million.
I think it's perfectly acceptable to have a small number of collisions
and that size savings is a much more important factor.
Bear in mind, not only does a collision have to happen, but it has to
change the actual result *in the wrong direction* before we actually
start caring.
Daniel
--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting
Re: a note on multiword hashed tokens and collisions (fwd)
Posted by Michael Parker <pa...@pobox.com>.
On Wed, Apr 21, 2004 at 10:15:13AM +1200, Sidney Markowitz wrote:
> Michael Parker wrote:
> >Question is, is using that value gonna work in the long run
> >for dbs with 3-4 million tokens?
>
> substr(sha1($token), -5) and CHAR(5) is good to about 2 million using my
> no-brainer criteria of expecting no collisions at all.
>
[ Snip Sidney's excellent analysis ]
Ahhh...ok it's much clearer now, thanks.
>
> substr(sha1($token), -6) and CHAR(6) is no-brainer good to about 32
> million tokens.
>
> Can you try your benchmark with that? I expect that it would get you the
> same performance as the 40 bit version and still enough reduction in db
> size to be worthwhile.
>
It should only really affect size (~5% greater than the CHAR(5)
version), so I didn't do a full benchmark.
> But I think that my numbers show that 40-bit should be ok at 4 million
> and certainly at 3 million.
I think the overlap of a few tokens is fine and I'm gonna move forward
with the CHAR(5) version.
For SQL we gain a 40% space savings over the previous method (this
includes switching to int userid instead of username) using MyISAM
tables in MySQL.
Michael
Re: a note on multiword hashed tokens and collisions (fwd)
Posted by Sidney Markowitz <si...@sidney.com>.
Michael Parker wrote:
> Question is, is using that value gonna work in the long run
> for dbs with 3-4 million tokens?
substr(sha1($token), -5) and CHAR(5) is good to about 2 million using my
no-brainer criteria of expecting no collisions at all.
Ok, time for me to do some math for the non-no-brainer case :-)
[looking things up... writing simulations in Lisp... ok, done]
When you get to 3 million tokens in the db you can expect 1 or 2
collisions. At 4 million, 8 to 16 collisions. A collision means that two
tokens which are different are treated as the same token.
So worse case, with a db of 4 million tokens, 32 tokens, grouped in 16
pairs, are being classified incorrectly. This can only make a difference
if one token of a colliding pair is a significant ham or spam sign and
the other is either not significant or is significant in the opposite way.
If you pick 150 tokens at random to evaluate a message, the probability
of choosing one of the colliding tokens is 150 * (32 / 4000000) = 0.0012
That means that you have a 1 in a thousand chance of having one token
out of the 150 contributing an inaccurate number to the calculation of
the Bayes score of a message. 1 in a million that two of them will be
wrong. Except that the probability is even lower, since the 150 most
significant tokens are selected and there is a low probability that any
of the 32 colliding tokens are ones that have a high significance.
Now if that chance of collision is too high for you or you want to stick
to no-brainer numbers, you only need two more bits in the hash to double
the no-brainer limit of 2 million up to 4 million. Of course we don't
have two bits available without going up another whole byte, which makes it:
substr(sha1($token), -6) and CHAR(6) is no-brainer good to about 32
million tokens.
Can you try your benchmark with that? I expect that it would get you the
same performance as the 40 bit version and still enough reduction in db
size to be worthwhile.
But I think that my numbers show that 40-bit should be ok at 4 million
and certainly at 3 million.
-- sidney
Re: a note on multiword hashed tokens and collisions (fwd)
Posted by Michael Parker <pa...@pobox.com>.
I absolutely suck at math so I'm not even gonna think about it. But
I've seen token databases with 3+ million tokens.
I've got bayes code running now using hashes, using Sidney's
substr(sha1($token), -5) value. It provides a slight speedup (maybe
10-20%) on scanning. With the corpus I'm using I see ~7% decrease in
db size (for DBM, I haven't looked SQL dbs yet), but my average toen
size has been around 9 chars so we're not exactly shrinking things a
ton.
Question is, is using that value gonna work in the long run for dbs
with 3-4 million tokens?
Michael
PS Interesting factoid, on my benchmark corpus (6000 ham and 6000
spam) we extract ~120k more tokens in 3.0.0 than we did in 2.63.
Re: a note on multiword hashed tokens and collisions (fwd)
Posted by Sidney Markowitz <si...@sidney.com>.
Justin Mason wrote:
> So worth noting -- CRM-114 has had to adopt special training strategies
> to avoid collisions when using hashed multiword tokens.
I looked up the details of CRM-114 tokenization and hashing. They
generate 16 multiword tokens per parsed token (as it said in the piece
you quoted) and the hash is only 32 bits. Birthday paradox says to
expect collisions starting at 2 times the square root of the hash space,
which means once you have 2^17 multiword tokens. That roughly compares
to having 1/16 that many single tokens in a database, or 2^13.
Think what it would be like if we ran into collisions once we hit a
little over 8,000 distinct tokens in the database. Of course CRM114 has
problems with it.
This isn't rocket science. I don't know what those people are thinking.
If you want to use a hash with low probability of collisions, you use
enough bits in the hash function. A good rule of thumb is at least twice
as many bits in the hash function as in the number of items that you are
hashing. So the 40 bit hash I was using allows for a database of around
a million unique tokens. It's my understanding that a large Bayes db
would be under half a million. If we go to multiword hashes like CRM114
that multiply the number of tokens by 16 we should add eight bits to the
hash function size. So the db would go from 5n bytes required for the
tokens to (n * 16) * 6 = 96n, so would be 96/5 or almost 20 times
bigger. Something to keep in mind if we consider multiword tokens.
-- sidney