You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Nicki Messerschmidt <sp...@alienn.net> on 2004/08/04 21:58:02 UTC

Request for information about bayesian filter

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi list,
I'm seeking information about bayesian filters. I'm using spamassassin
on our mail server with auto learning.
No I was asked if a very uneven ham/spam ration of 1:10 does harm the
filtering done by the bayesian database.

Has anyone of you more information and/or experience on this subject?


Cheers and thanks
Nicki

- --
Linksystem Muenchen GmbH                          info@link-m.de
Schloerstrasse 10                           http://www.link-m.de
80634 Muenchen                              Tel. 089 / 890 518-0
We make the Net work.                       Fax 089 / 890 518-77
PGP Keys:                             https://www.link-m.de/pgp/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1-nr1 (Windows 2000)
Comment: Get keys at: https://www.link-m.de/pgp

iD8DBQFBET/K6zWc+bXuIEMRAmziAJ9MPeATf1/NWZy50ZLE7m/JbqjN5QCgsC8l
SGw2ph5tCW9ZH76umTmnBkE=
=id0J
-----END PGP SIGNATURE-----

Re: Request for information about bayesian filter

Posted by Matt Kettler <mk...@evi-inc.com>.
At 08:43 AM 8/5/2004, Nicki Messerschmidt wrote:
>Do you have any information about how and when SA3 expires information
>from the bayesian database?
>I'd like to "preinstall" a pretrained database for each user und hope
>that the database is not emptied on instant if I add a user in three
>month with the database from today.

I don't know for sure, but I imagine the expiry is similar to what's used 
in SA 2.5x and 2.6x. It may have some differences in the exact details, but 
from a high-level perspective it should be similar.

For reference, the method is:

when the number of tokens in the database exceedes bayes_expiry_max_db_size 
expire the least recently used tokens until there are only 75% of 
bayes_expiry_max_db_size tokens left. Never expire to contain less than 
100,000 tokens regardless of what the max_db_size is.

Thus you shouldn't run into problems. SA will never completely empty the 
bayes DB via expiry.



Re: Request for information about bayesian filter

Posted by Nicki Messerschmidt <sp...@alienn.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Kettler said the following:

> At 03:58 PM 8/4/2004, Nicki Messerschmidt wrote:
>> I'm seeking information about bayesian filters. I'm using
>> spamassassin on our mail server with auto learning. No I was
>> asked if a very uneven ham/spam ration of 1:10 does harm the
>> filtering done by the bayesian database.
>> Has anyone of you more information and/or experience on this
>> subject?
> And in general bayes is pretty resilient to gross deviations from
> the "perfect" ratio. My training ratio is coming in at about 1:26.
> My real-world inbound ratio seems to be about 1:10 or so, thus I'm
> even further over than that. I'm not having any problems so far.
>
> The only situation you might run into is if you're severely
> undertraining ham and overtraining spam, bayes poisoning might
> start making nonspam emails score higher in the BAYES_ ranks.

Do you have any information about how and when SA3 expires information
from the bayesian database?
I'd like to "preinstall" a pretrained database for each user und hope
that the database is not emptied on instant if I add a user in three
month with the database from today.


Cheers and thanks
Nicki

- --
Linksystem Muenchen GmbH                          info@link-m.de
Schloerstrasse 10                           http://www.link-m.de
80634 Muenchen                              Tel. 089 / 890 518-0
We make the Net work.                       Fax 089 / 890 518-77
PGP Keys:                             https://www.link-m.de/pgp/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1-nr1 (Windows 2000)
Comment: Get keys at: https://www.link-m.de/pgp

iD8DBQFBEite6zWc+bXuIEMRAvQ0AJ9ZSRmb0+M98n/xA6WogC3iOCGb+gCfSiAG
e0BvIeo56268ru0NheC3TVI=
=6Nh+
-----END PGP SIGNATURE-----

Re: Request for information about bayesian filter

Posted by Jim Maul <jm...@elih.org>.
Quoting Matt Kettler <mk...@evi-inc.com>:

> At 03:58 PM 8/4/2004, Nicki Messerschmidt wrote:
>> I'm seeking information about bayesian filters. I'm using spamassassin
>> on our mail server with auto learning.
>> No I was asked if a very uneven ham/spam ration of 1:10 does harm the
>> filtering done by the bayesian database.
>>
>> Has anyone of you more information and/or experience on this subject?
>
> First, IMO, it's a *complete* misconception that a "perfect" bayes database
> should be trained with a 1:1 ratio. That's complete nonsense and discard
> such garbage from your mind at once. Bayes is a statistical system.
> Statistical systems work best when given REALISTIC input. Thus the
> "perfect" ratio isn't 1:1, it's whatever your real-world ham:spam ratio is.
> And I don't know about your network, but on mine, inbound spam outnumbers
> inbound ham by quite a lot.
>
> And in general bayes is pretty resilient to gross deviations from the
> "perfect" ratio. My training ratio is coming in at about 1:26. My
> real-world inbound ratio seems to be about 1:10 or so, thus I'm even
> further over than that. I'm not having any problems so far.
>
> The only situation you might run into is if you're severely undertraining
> ham and overtraining spam, bayes poisoning might start making nonspam
> emails score higher in the BAYES_ ranks.

I agree with this completely.  Although my numbers are drastically 
different, my
bayes scores are still working great

0.000          0        482          0  non-token data: nspam
0.000          0      14325          0  non-token data: nham

My system tends to have about a 25:1 ham:spam ratio.

Jim

Re: Request for information about bayesian filter

Posted by Matt Kettler <mk...@evi-inc.com>.
At 06:10 AM 8/5/2004, Nicki Messerschmidt wrote:
>And the question is, when does the posining begin?
>Has anyone some reliable information about the approximate ham:spam
>ration at which poising would take place?

That is a function of both the ratio AND the spam itself.

Really, I think for "pure" spam and ham, you could have a ratio of 10,000:1 
and be fine.

The problem isn't so much self poisoning, as it is weakening yourself to 
intentional poisoning on the part of the spammer. If you have a heavily 
off-balance training ratio and a lot of spam containing intentional bayes 
poison, you can run into FP problems on the ham side because the poison 
tokens are going to start drowning everything out. Conversely if your ratio 
is heavily off-balance towards the ham side, spam containing poison will be 
more likely to evade the bayes filter.

Effectively this is a function of the tokens, not the emails, so it's a 
function of about 100,000 variables, thus it'd be hard to boil it down to 
anything as simple as a "dangerous ratio".

I suppose you could do a measurement for a given pile of spam and ham, but 
since spam constantly changes it's behaviors the "danger" level is going 
change constantly as well.

My ballpark guess, based on my experience is that a bayes DB with decent 
volume of training (at least 100 emails a day) would likely start to have 
noticeable bayes misclassification problems somewhere near spam:ham ratios 
of 100:1 or 1:50.


Re: Request for information about bayesian filter

Posted by Nicki Messerschmidt <sp...@alienn.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Kettler said the following:

> At 03:58 PM 8/4/2004, Nicki Messerschmidt wrote:
>> I'm seeking information about bayesian filters. I'm using
>> spamassassin on our mail server with auto learning. No I was
>> asked if a very uneven ham/spam ration of 1:10 does harm the
>> filtering done by the bayesian database.
>> Has anyone of you more information and/or experience on this
>> subject?
> The only situation you might run into is if you're severely
> undertraining ham and overtraining spam, bayes poisoning might
> start making nonspam emails score higher in the BAYES_ ranks.

And the question is, when does the posining begin?
Has anyone some reliable information about the approximate ham:spam
ration at which poising would take place?


Cheers and thanks for the replies so far
Nicki

- --
Linksystem Muenchen GmbH                          info@link-m.de
Schloerstrasse 10                           http://www.link-m.de
80634 Muenchen                              Tel. 089 / 890 518-0
We make the Net work.                       Fax 089 / 890 518-77
PGP Keys:                             https://www.link-m.de/pgp/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1-nr1 (Windows 2000)
Comment: Get keys at: https://www.link-m.de/pgp

iD8DBQFBEgeW6zWc+bXuIEMRAgVaAKD21DsqLy2k4qUVOBf9c2pWyqkoewCgtZAS
9dKmaRrGAFTmRBv2Jtd3tdQ=
=9SFL
-----END PGP SIGNATURE-----

Re: Request for information about bayesian filter

Posted by Matt Kettler <mk...@evi-inc.com>.
At 03:58 PM 8/4/2004, Nicki Messerschmidt wrote:
>I'm seeking information about bayesian filters. I'm using spamassassin
>on our mail server with auto learning.
>No I was asked if a very uneven ham/spam ration of 1:10 does harm the
>filtering done by the bayesian database.
>
>Has anyone of you more information and/or experience on this subject?

First, IMO, it's a *complete* misconception that a "perfect" bayes database 
should be trained with a 1:1 ratio. That's complete nonsense and discard 
such garbage from your mind at once. Bayes is a statistical system. 
Statistical systems work best when given REALISTIC input. Thus the 
"perfect" ratio isn't 1:1, it's whatever your real-world ham:spam ratio is. 
And I don't know about your network, but on mine, inbound spam outnumbers 
inbound ham by quite a lot.

And in general bayes is pretty resilient to gross deviations from the 
"perfect" ratio. My training ratio is coming in at about 1:26. My 
real-world inbound ratio seems to be about 1:10 or so, thus I'm even 
further over than that. I'm not having any problems so far.

The only situation you might run into is if you're severely undertraining 
ham and overtraining spam, bayes poisoning might start making nonspam 
emails score higher in the BAYES_ ranks.




Re: Request for information about bayesian filter

Posted by Robert Menschel <Ro...@Menschel.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Nicki,

Wednesday, August 4, 2004, 12:58:02 PM, you wrote:

NM> I'm seeking information about bayesian filters. I'm using
NM> spamassassin on our mail server with auto learning. No I was asked if
NM> a very uneven ham/spam ration of 1:10 does harm the filtering done by
NM> the bayesian database.   

NM> Has anyone of you more information and/or experience on this subject?

My ratio is pretty close to 1:10, and Bayes works wonders here.

Bob Menschel

-----BEGIN PGP SIGNATURE-----
Version: PGP 8.0.3

iQA/AwUBQRHI5JebK8E4qh1HEQKxHQCg6lU+HPMXjhH7mNYhxYBy2TE+enkAoJxw
/mVYiQ+llZEDCgNNMGEFZNt6
=5Sy4
-----END PGP SIGNATURE-----