You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Arthur Kerpician <ar...@bluechip.ro> on 2009/04/09 08:41:22 UTC

bayes learn best practice

Hi,
I recently upgraded to 3.2.5 and re-trained bayes db from scratch. The 
auto-learn is on so now I have about 6000 mails trained as spam and 3000 
as ham. I tried to manually keep both spam and ham at the same level in 
the bayes db but it seems that spamassassin is learning spam twice as 
fast as ham. The docs mention that after 5000 spam and ham learned, 
spamassassin doesn't improve spam detection much. What is the best 
practice to optimize the bayes detection? Should I stop auto-learning 
after reaching the 5000 mark and than re-train from time to time from 
scratch?

Thanks,
Arthur

Re: bayes learn best practice

Posted by Michael Scheidell <sc...@secnap.net>.

Arthur Kerpician wrote:
>
>
> I was thinking to increase bayes_auto_learn_threshold_spam to a higher 
> number, so less spam is auto-learned. Is this ok?

I try to keep it a 10 to 1 ratio (dropping the _ham threshold and 
increasing the _spam threshold), basically, trying to mimic the global 
stats of a 10 to 1 spam/ham ratio.

No real reason for that, just superstition I guess!

-- 
Michael Scheidell, CTO
Phone: 561-999-5000, x 1259
 > *| *SECNAP Network Security Corporation

    * Certified SNORT Integrator
    * 2008-9 Hot Company Award Winner, World Executive Alliance
    * Five-Star Partner Program 2009, VARBusiness
    * Best Anti-Spam Product 2008, Network Products Guide
    * King of Spam Filters, SC Magazine 2008

_________________________________________________________________________
This email has been scanned and certified safe by SpammerTrap(r). 
For Information please see http://www.secnap.com/products/spammertrap/
_________________________________________________________________________

Re: bayes learn best practice

Posted by John Hardin <jh...@impsec.org>.

On Tue, 14 Apr 2009, Arthur Kerpician wrote:

> Again, if I choose to learn *all* spam and *all* ham, I'll end up with 
> big differences between their levels in bayes, which will affect spam 
> detection.

Not really, when you consider the volume of spam you receive far exceed 
the volumne of ham you receive.

> Anyway, in the mean time I stopped auto-learning and I'm manually 
> feeding missed spam. For every spam message fed, I train at least 1 ham. 
> So, this should keep the bayes db optimized.

While that won't hurt, I think you're worrying too much about it. If it 
bothers you that much, turn off autolearning and concentrate on training 
FPs and FNs.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   ...in the 2nd amendment the right to arms clause means you have
   the right to choose how many arms you want, and the militia clause
   means that Congress can punish you if the answer is "none."
                                 -- David Hardy, 2nd Amendment scholar
-----------------------------------------------------------------------
  Today: the 144th anniversary of Lincoln's assassination

Re: bayes learn best practice

Posted by Arthur Kerpician <ar...@bluechip.ro>.

Kai Schaetzl wrote:
> Arthur Kerpician wrote on Thu, 09 Apr 2009 20:25:42 +0300:
>
>   
>> . So from time to time I should 
>> feed ham manually to sa-learn, until it reaches the spam level again. Is 
>> this correct? If it is, I think it's rather time-consuming to always 
>> check the trained ham/spam and level them.
>>     
>
> There is no reason to this and nobody told you to do so :-)
> Whereever you read this you either misunderstood it or it was wrong.
> You can manually train ham *and* spam if you like. It's good to train all 
> the stuff that got missed or wasn't autolearned. But you have to be sure 
> it's learned for the "right side".
>   
I was talking about the fact that, in time, spam and ham auto-learned in 
bayes are going to be very different. The practice shows me that for 
every ham auto-learned there are 5-6 spams auto-learned. So, in time, 
learned spam will be 5-6 times the ham learned. As the manual explains, 
such big differences between the spam / ham levels trained for bayes 
will be a huge drawback in spam detection. This was the context in which 
I asked how should I keep both spam / ham levels even. And the 
self-answer was to manual feed the bayes with ham until it reaches the 
spam level learned. If I keep the auto-learning running the spam tokens 
will overcome ham tokens.
>   
>> I was thinking to increase bayes_auto_learn_threshold_spam to a higher 
>> number, so less spam is auto-learned. Is this ok?
>>     
>
> This would be nonsense. In theory you want to learn *all* ham and *all* 
> spam. As you obviously can't do this you learn *as much as possible*, 
> within the constraints of your operation.
>   
Again, if I choose to learn *all* spam and *all* ham, I'll end up with 
big differences between their levels in bayes, which will affect spam 
detection.

Anyway, in the mean time I stopped auto-learning and I'm manually 
feeding missed spam. For every spam message fed, I train at least 1 ham. 
So, this should keep the bayes db optimized.

Re: bayes learn best practice

Posted by Kai Schaetzl <ma...@conactive.com>.

Arthur Kerpician wrote on Thu, 09 Apr 2009 20:25:42 +0300:

> . So from time to time I should 
> feed ham manually to sa-learn, until it reaches the spam level again. Is 
> this correct? If it is, I think it's rather time-consuming to always 
> check the trained ham/spam and level them.

There is no reason to this and nobody told you to do so :-)
Whereever you read this you either misunderstood it or it was wrong.
You can manually train ham *and* spam if you like. It's good to train all 
the stuff that got missed or wasn't autolearned. But you have to be sure 
it's learned for the "right side".

> 
> I was thinking to increase bayes_auto_learn_threshold_spam to a higher 
> number, so less spam is auto-learned. Is this ok?

This would be nonsense. In theory you want to learn *all* ham and *all* 
spam. As you obviously can't do this you learn *as much as possible*, 
within the constraints of your operation.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com

Re: bayes learn best practice

Posted by Arthur Kerpician <ar...@bluechip.ro>.

Kai Schaetzl wrote:
> Arthur Kerpician wrote on Thu, 09 Apr 2009 09:41:22 +0300:
>
>   
>> The docs mention that after 5000 spam and ham learned, 
>> spamassassin doesn't improve spam detection much.
>>     
>
> do they? What is meant is that once you reach some threshold the detection 
> rate doesn't improve as good as before. You can't get any better as 
> "nearly everything". But it will drop if no new tokens get added.
>
> What is the best 
>   
>> practice to optimize the bayes detection? Should I stop auto-learning 
>> after reaching the 5000 mark and than re-train from time to time from 
>> scratch?
>>     
>
> No, keep the automatic training (unless there are too many FPs in the 
> autotrained messages). Do a regular manual expire, so old tokens are 
> purged out.
>   
I don't get many FPs or FNs after upgrading to 3.2.5 and retraining 
bayes. But, if I keep auto-learning enabled, I should monitor the 
trained spam and ham levels and manual train ham when the spam exceeds 
it (as it will always exceed ham level). So from time to time I should 
feed ham manually to sa-learn, until it reaches the spam level again. Is 
this correct? If it is, I think it's rather time-consuming to always 
check the trained ham/spam and level them.

I was thinking to increase bayes_auto_learn_threshold_spam to a higher 
number, so less spam is auto-learned. Is this ok?

Re: bayes learn best practice

Posted by Kai Schaetzl <ma...@conactive.com>.

Arthur Kerpician wrote on Thu, 09 Apr 2009 09:41:22 +0300:

> The docs mention that after 5000 spam and ham learned, 
> spamassassin doesn't improve spam detection much.

do they? What is meant is that once you reach some threshold the detection 
rate doesn't improve as good as before. You can't get any better as 
"nearly everything". But it will drop if no new tokens get added.

What is the best 
> practice to optimize the bayes detection? Should I stop auto-learning 
> after reaching the 5000 mark and than re-train from time to time from 
> scratch?

No, keep the automatic training (unless there are too many FPs in the 
autotrained messages). Do a regular manual expire, so old tokens are 
purged out.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com

Re: bayes learn best practice

Posted by John Hardin <jh...@impsec.org>.

On Thu, 9 Apr 2009, Arthur Kerpician wrote:

> I tried to manually keep both spam and ham at the same level in 
> the bayes db but it seems that spamassassin is learning spam twice as 
> fast as ham.

Not surprising, as raw email traffic has a very skewed spam:ham ratio. 
Surely you've heard the stats that "90% of all email is spam"?

> The docs mention that after 5000 spam and ham learned, spamassassin 
> doesn't improve spam detection much. What is the best practice to 
> optimize the bayes detection? Should I stop auto-learning after reaching 
> the 5000 mark and than re-train from time to time from scratch?

I'll let others comment on issues like disk space and scan time w/r/t 
bayes database size. For myself, I have a _very_ small userbase and do 
purely manual training with a small corpus. I have under 3000 tokens 
total and get good results.

Build good representative ham and spam corpa, and train any misses (FPs 
and FNs) going forward. Retain those messages. Unfortunately autolearn 
doesn't let you retain those messages.

Retraining from scratch is only really necessary if things have gone 
completely out of whack, and at that point you review your corpa carefully 
for misclassified messages, wipe and retrain. Bayes should only go bonkers 
if you have people manually training messages incorrectly, or (not too 
likely) if autolearn has taken a slightly-poor configuration and magnified 
the errors.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Gun Control enables genocide while doing little to reduce crime.
-----------------------------------------------------------------------
  4 days until Thomas Jefferson's 266th Birthday