You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Thomas Arend <ml...@arend-whv.info> on 2005/02/16 22:07:13 UTC

spam ham ratio for bayes filter

Hello,

a lot of questions in this list are about the spam : ham ratio to be trained 
and how much mails should be trained. One continuously read myth is the 1 : 1 
ratio.

I read an article about the best ratio as 1 :  1 and it was expirienced by a 
test and later on derived from the bayesian theorem. Unfortunately I didn't 
copy this article and can't remember enough to find the article by googling.

The problem is the conclusion of the article was wrong.

What I will try to show in the next steps - which unfortunately require a 
little bit algebra - is: Train bayes filter in accordance with your real spam 
ham ratio and train as much as possible. But never train to less ham or train 
only spam!

Here my argument follows:

In short the bayes theorem says 

P(Spam|Token) = P(Token|Spam)*P(Spam)/P(Token) 

that means: the probability of a message being Spam under the contition that  
a token is in the message 
is equal to 
the propability of the Token contained in a Spam message 
multiplied by 
the propability of a message being spam 
devided by the propability of any message containig the token.

So if you have received s spam messages and h ham messages where the token is 
in S spams and in H ham messages then you get: 

s = number of spam messages
h = number of ham messages
S = number of spam messages containing the token
H = number of ham messages containing the token
s+h= number of messages
S+H = number of messages containing the token

Therefor 

	P(Spam) = s/(s+h) 

is an aproximation of the probability of a random message being spam.

And for:
	P(Token) = (S+H)/(s+h)
	P(Token|Spam) = S/s


that leads to

	P(Spam|Token) 	= S/s *s/(s+h) / ((S+H)/(s+h))
					= S / (S+H)

That means, that the probability of a given message being spam when it 
contians a token is independend of the number of messages trained. 

Lets say in your real spam ham ratio is 10 to 1 and your messge body contains 
1100 messages. 100 spam and also 50 ham messages should contain a certain 
token. Lets say "vi@gr@". 

Total Messages: 1100
Spam (trained): 1000
Ham: 100
vi@gr@: in 100 spam and 50 ham 

If you train all messages you will get a propability of 100 / (100+50) = 66.6% 
for the next message containing the token of being spam. Which isn't a high 
probability but works fine for this example.

If you train only 10% of your spam to get the spam ham ration of 1:1 you will 
supposably count only 10 spam messages with the token. 

Spam (trained): 100
Ham: 100
vi@gr@: in 10 (=10% of 100) spam and 50 ham 

Which leads to a spam probability of only 10 /(10+50) = 16.6%
Which is a little bit low.
 
What happens when you train less ham? 

Lets assume you train only 50% of your ham but all your spam. You will 
supposably count only 25 ham messages with the token. 

Spam (trained): 1000
Ham (50% trained): 50
vi@gr@: in 100 spam and 25 (= 50% of 50) ham 

Which leads to a spam probability of 100 /(100+25) = 80%.

What happens when your ham spam ratio is 10 to 1?

Ham = 1000
Spam = 100
vi@gr@: in 100 ham and 50 spam 

=> 50 / (50+100) = 33.3%

Ham (10% trained) = 100
Spam = 100
vi@gr@: in 10 (=10% of 100) ham and 50 spam 

=> 50 / (50+10) = 83.3%

OOps!!!

So if you train to less spam you will get a higher False Negative rate, if you 
train to less ham you will get a higher False Positive rate.

Because a False Positive is more harmfull than a False Negative my conclusion 
is:
	train iaw your real spam ham ratio, train as much as possible (= train all 
	messages), but never train to less ham or train only spam!

(BTW: The risk of a False Positives is the reason why Paul Graham multiplied 
his token counts for ham with 2)

Another lesson should be: Never train whitelisted mails as ham!!!

 
Best regards 


Thomas Arend

PS: I hope I made no mistakes.
-- 
icq:133073900
http://www.t-arend.de

Re: spam ham ratio for bayes filter

Posted by Thomas Arend <ml...@arend-whv.info>.
Hello Thomas,

Am Donnerstag, 17. Februar 2005 19:17 schrieb Thomas Bolioli:
> Interesting but what happens in the case where someone, like me, is
> getting 250+ spam a day and only about ten or so legitimate emails? This
> is not counting this account that my mailing lists go to which I have
> far better bayes performance on (1:100 spam/ham ratio instead of 10:1 or
> lower with my other accounts). With autotraining turned on, that means
> far more spam will get trained.

Yes. 

> Even if I turned off auto training, and 
> trained only the ham that came through, it would simply allow changes in
> spam to begin to defeat the bayes filter over time, is that not so?

Yes.

You must train both ham and spam frequently to catch up small changes in the 
mails.  Bayes needs to know which tokens are in ham and which are in spam. 
For the filter it doesn't matter what you call ham or spam. It just collects 
information about to classes 'ham' and 'spam' decides  on the statistical 
date to which class a new message may belong. For the filter it does not 
matter if you have a high spam to ham or a high ham to spam rate.  

Bayesian filtering is done on tokens seen before. If you don't train spam you 
will spoil your filter, because he doesn't learn new tokens. 

If you train 1 : 1 the the filter assumes that 50% of you mail is ham and 50% 
is spam. In real it may be that 96% is spam.

What happens when your spam ratio is 100 to 1?
This is a extrem example:

Ham = 100
Spam = 10000
vi@gr@: in 50 ham and 100 spam

=> 100 / (50+100) = 66.3%

Every second ham message contains the tolken and one of 100 spam messages.
That means if you get a message with the token it is in 2 of 3 cases spam!
Taht what bayesian filtering say i got a message it a has special tokens and 
due to the history it was in 2 of 3 times spam.

What happens when you train only 100 spam massages to get te ratio 1:1.

Ham (100% trained) = 100
Spam (1% trained) = 100
vi@gr@: in 50 ham and 1 ( =1 % ) spam (we where lucky and got one message with 
the token.)

=> 1 / (50+1) = 1.9%

The bayesian filter will now say that the message is with a probability of 
1.9% spam and with 98.1% ham. The fliter is useless it declares everything to 
ham. If the ratio is the other way it declares everything as ham.

> Doesn't that mean that the expiration system that SA employs solves that
> problem?

No. Expiration only reduces the size of the database. It drops unused tokens, 
which for a long time didn't apear in a message. So if you don't train spam 
you will have no spamy tokens anymore which spoils the filter.

Regards

Thomas Arend

[..]
-- 
icq:133073900
http://www.t-arend.de

Re: spam ham ratio for bayes filter

Posted by Thomas Bolioli <tp...@terranovum.com>.
Interesting but what happens in the case where someone, like me, is 
getting 250+ spam a day and only about ten or so legitimate emails? This 
is not counting this account that my mailing lists go to which I have 
far better bayes performance on (1:100 spam/ham ratio instead of 10:1 or 
lower with my other accounts). With autotraining turned on, that means 
far more spam will get trained. Even if I turned off auto training, and 
trained only the ham that came through, it would simply allow changes in 
spam to begin to defeat the bayes filter over time, is that not so? 
Doesn't that mean that the expiration system that SA employs solves that 
problem?
Tom

Thomas Arend wrote:

>Hello,
>
>a lot of questions in this list are about the spam : ham ratio to be trained 
>and how much mails should be trained. One continuously read myth is the 1 : 1 
>ratio.
>
>I read an article about the best ratio as 1 :  1 and it was expirienced by a 
>test and later on derived from the bayesian theorem. Unfortunately I didn't 
>copy this article and can't remember enough to find the article by googling.
>
>The problem is the conclusion of the article was wrong.
>
>What I will try to show in the next steps - which unfortunately require a 
>little bit algebra - is: Train bayes filter in accordance with your real spam 
>ham ratio and train as much as possible. But never train to less ham or train 
>only spam!
>
>Here my argument follows:
>
>In short the bayes theorem says 
>
>P(Spam|Token) = P(Token|Spam)*P(Spam)/P(Token) 
>
>that means: the probability of a message being Spam under the contition that  
>a token is in the message 
>is equal to 
>the propability of the Token contained in a Spam message 
>multiplied by 
>the propability of a message being spam 
>devided by the propability of any message containig the token.
>
>So if you have received s spam messages and h ham messages where the token is 
>in S spams and in H ham messages then you get: 
>
>s = number of spam messages
>h = number of ham messages
>S = number of spam messages containing the token
>H = number of ham messages containing the token
>s+h= number of messages
>S+H = number of messages containing the token
>
>Therefor 
>
>	P(Spam) = s/(s+h) 
>
>is an aproximation of the probability of a random message being spam.
>
>And for:
>	P(Token) = (S+H)/(s+h)
>	P(Token|Spam) = S/s
>
>
>that leads to
>
>	P(Spam|Token) 	= S/s *s/(s+h) / ((S+H)/(s+h))
>					= S / (S+H)
>
>That means, that the probability of a given message being spam when it 
>contians a token is independend of the number of messages trained. 
>
>Lets say in your real spam ham ratio is 10 to 1 and your messge body contains 
>1100 messages. 100 spam and also 50 ham messages should contain a certain 
>token. Lets say "vi@gr@". 
>
>Total Messages: 1100
>Spam (trained): 1000
>Ham: 100
>vi@gr@: in 100 spam and 50 ham 
>
>If you train all messages you will get a propability of 100 / (100+50) = 66.6% 
>for the next message containing the token of being spam. Which isn't a high 
>probability but works fine for this example.
>
>If you train only 10% of your spam to get the spam ham ration of 1:1 you will 
>supposably count only 10 spam messages with the token. 
>
>Spam (trained): 100
>Ham: 100
>vi@gr@: in 10 (=10% of 100) spam and 50 ham 
>
>Which leads to a spam probability of only 10 /(10+50) = 16.6%
>Which is a little bit low.
> 
>What happens when you train less ham? 
>
>Lets assume you train only 50% of your ham but all your spam. You will 
>supposably count only 25 ham messages with the token. 
>
>Spam (trained): 1000
>Ham (50% trained): 50
>vi@gr@: in 100 spam and 25 (= 50% of 50) ham 
>
>Which leads to a spam probability of 100 /(100+25) = 80%.
>
>What happens when your ham spam ratio is 10 to 1?
>
>Ham = 1000
>Spam = 100
>vi@gr@: in 100 ham and 50 spam 
>
>=> 50 / (50+100) = 33.3%
>
>Ham (10% trained) = 100
>Spam = 100
>vi@gr@: in 10 (=10% of 100) ham and 50 spam 
>
>=> 50 / (50+10) = 83.3%
>
>OOps!!!
>
>So if you train to less spam you will get a higher False Negative rate, if you 
>train to less ham you will get a higher False Positive rate.
>
>Because a False Positive is more harmfull than a False Negative my conclusion 
>is:
>	train iaw your real spam ham ratio, train as much as possible (= train all 
>	messages), but never train to less ham or train only spam!
>
>(BTW: The risk of a False Positives is the reason why Paul Graham multiplied 
>his token counts for ham with 2)
>
>Another lesson should be: Never train whitelisted mails as ham!!!
>
> 
>Best regards 
>
>
>Thomas Arend
>
>PS: I hope I made no mistakes.
>  
>