You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Arthur Kerpician <ar...@bluechip.ro> on 2009/04/09 08:41:22 UTC
bayes learn best practice
Hi,
I recently upgraded to 3.2.5 and re-trained bayes db from scratch. The
auto-learn is on so now I have about 6000 mails trained as spam and 3000
as ham. I tried to manually keep both spam and ham at the same level in
the bayes db but it seems that spamassassin is learning spam twice as
fast as ham. The docs mention that after 5000 spam and ham learned,
spamassassin doesn't improve spam detection much. What is the best
practice to optimize the bayes detection? Should I stop auto-learning
after reaching the 5000 mark and than re-train from time to time from
scratch?
Thanks,
Arthur
Re: bayes learn best practice
Posted by Michael Scheidell <sc...@secnap.net>.
Arthur Kerpician wrote:
>
>
> I was thinking to increase bayes_auto_learn_threshold_spam to a higher
> number, so less spam is auto-learned. Is this ok?
I try to keep it a 10 to 1 ratio (dropping the _ham threshold and
increasing the _spam threshold), basically, trying to mimic the global
stats of a 10 to 1 spam/ham ratio.
No real reason for that, just superstition I guess!
--
Michael Scheidell, CTO
Phone: 561-999-5000, x 1259
> *| *SECNAP Network Security Corporation
* Certified SNORT Integrator
* 2008-9 Hot Company Award Winner, World Executive Alliance
* Five-Star Partner Program 2009, VARBusiness
* Best Anti-Spam Product 2008, Network Products Guide
* King of Spam Filters, SC Magazine 2008
_________________________________________________________________________
This email has been scanned and certified safe by SpammerTrap(r).
For Information please see http://www.secnap.com/products/spammertrap/
_________________________________________________________________________
Re: bayes learn best practice
Posted by John Hardin <jh...@impsec.org>.
On Tue, 14 Apr 2009, Arthur Kerpician wrote:
> Again, if I choose to learn *all* spam and *all* ham, I'll end up with
> big differences between their levels in bayes, which will affect spam
> detection.
Not really, when you consider the volume of spam you receive far exceed
the volumne of ham you receive.
> Anyway, in the mean time I stopped auto-learning and I'm manually
> feeding missed spam. For every spam message fed, I train at least 1 ham.
> So, this should keep the bayes db optimized.
While that won't hurt, I think you're worrying too much about it. If it
bothers you that much, turn off autolearning and concentrate on training
FPs and FNs.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
...in the 2nd amendment the right to arms clause means you have
the right to choose how many arms you want, and the militia clause
means that Congress can punish you if the answer is "none."
-- David Hardy, 2nd Amendment scholar
-----------------------------------------------------------------------
Today: the 144th anniversary of Lincoln's assassination
Re: bayes learn best practice
Posted by Arthur Kerpician <ar...@bluechip.ro>.
Kai Schaetzl wrote:
> Arthur Kerpician wrote on Thu, 09 Apr 2009 20:25:42 +0300:
>
>
>> . So from time to time I should
>> feed ham manually to sa-learn, until it reaches the spam level again. Is
>> this correct? If it is, I think it's rather time-consuming to always
>> check the trained ham/spam and level them.
>>
>
> There is no reason to this and nobody told you to do so :-)
> Whereever you read this you either misunderstood it or it was wrong.
> You can manually train ham *and* spam if you like. It's good to train all
> the stuff that got missed or wasn't autolearned. But you have to be sure
> it's learned for the "right side".
>
I was talking about the fact that, in time, spam and ham auto-learned in
bayes are going to be very different. The practice shows me that for
every ham auto-learned there are 5-6 spams auto-learned. So, in time,
learned spam will be 5-6 times the ham learned. As the manual explains,
such big differences between the spam / ham levels trained for bayes
will be a huge drawback in spam detection. This was the context in which
I asked how should I keep both spam / ham levels even. And the
self-answer was to manual feed the bayes with ham until it reaches the
spam level learned. If I keep the auto-learning running the spam tokens
will overcome ham tokens.
>
>> I was thinking to increase bayes_auto_learn_threshold_spam to a higher
>> number, so less spam is auto-learned. Is this ok?
>>
>
> This would be nonsense. In theory you want to learn *all* ham and *all*
> spam. As you obviously can't do this you learn *as much as possible*,
> within the constraints of your operation.
>
Again, if I choose to learn *all* spam and *all* ham, I'll end up with
big differences between their levels in bayes, which will affect spam
detection.
Anyway, in the mean time I stopped auto-learning and I'm manually
feeding missed spam. For every spam message fed, I train at least 1 ham.
So, this should keep the bayes db optimized.
Re: bayes learn best practice
Posted by Kai Schaetzl <ma...@conactive.com>.
Arthur Kerpician wrote on Thu, 09 Apr 2009 20:25:42 +0300:
> . So from time to time I should
> feed ham manually to sa-learn, until it reaches the spam level again. Is
> this correct? If it is, I think it's rather time-consuming to always
> check the trained ham/spam and level them.
There is no reason to this and nobody told you to do so :-)
Whereever you read this you either misunderstood it or it was wrong.
You can manually train ham *and* spam if you like. It's good to train all
the stuff that got missed or wasn't autolearned. But you have to be sure
it's learned for the "right side".
>
> I was thinking to increase bayes_auto_learn_threshold_spam to a higher
> number, so less spam is auto-learned. Is this ok?
This would be nonsense. In theory you want to learn *all* ham and *all*
spam. As you obviously can't do this you learn *as much as possible*,
within the constraints of your operation.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
Re: bayes learn best practice
Posted by Arthur Kerpician <ar...@bluechip.ro>.
Kai Schaetzl wrote:
> Arthur Kerpician wrote on Thu, 09 Apr 2009 09:41:22 +0300:
>
>
>> The docs mention that after 5000 spam and ham learned,
>> spamassassin doesn't improve spam detection much.
>>
>
> do they? What is meant is that once you reach some threshold the detection
> rate doesn't improve as good as before. You can't get any better as
> "nearly everything". But it will drop if no new tokens get added.
>
> What is the best
>
>> practice to optimize the bayes detection? Should I stop auto-learning
>> after reaching the 5000 mark and than re-train from time to time from
>> scratch?
>>
>
> No, keep the automatic training (unless there are too many FPs in the
> autotrained messages). Do a regular manual expire, so old tokens are
> purged out.
>
I don't get many FPs or FNs after upgrading to 3.2.5 and retraining
bayes. But, if I keep auto-learning enabled, I should monitor the
trained spam and ham levels and manual train ham when the spam exceeds
it (as it will always exceed ham level). So from time to time I should
feed ham manually to sa-learn, until it reaches the spam level again. Is
this correct? If it is, I think it's rather time-consuming to always
check the trained ham/spam and level them.
I was thinking to increase bayes_auto_learn_threshold_spam to a higher
number, so less spam is auto-learned. Is this ok?
Re: bayes learn best practice
Posted by Kai Schaetzl <ma...@conactive.com>.
Arthur Kerpician wrote on Thu, 09 Apr 2009 09:41:22 +0300:
> The docs mention that after 5000 spam and ham learned,
> spamassassin doesn't improve spam detection much.
do they? What is meant is that once you reach some threshold the detection
rate doesn't improve as good as before. You can't get any better as
"nearly everything". But it will drop if no new tokens get added.
What is the best
> practice to optimize the bayes detection? Should I stop auto-learning
> after reaching the 5000 mark and than re-train from time to time from
> scratch?
No, keep the automatic training (unless there are too many FPs in the
autotrained messages). Do a regular manual expire, so old tokens are
purged out.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
Re: bayes learn best practice
Posted by John Hardin <jh...@impsec.org>.
On Thu, 9 Apr 2009, Arthur Kerpician wrote:
> I tried to manually keep both spam and ham at the same level in
> the bayes db but it seems that spamassassin is learning spam twice as
> fast as ham.
Not surprising, as raw email traffic has a very skewed spam:ham ratio.
Surely you've heard the stats that "90% of all email is spam"?
> The docs mention that after 5000 spam and ham learned, spamassassin
> doesn't improve spam detection much. What is the best practice to
> optimize the bayes detection? Should I stop auto-learning after reaching
> the 5000 mark and than re-train from time to time from scratch?
I'll let others comment on issues like disk space and scan time w/r/t
bayes database size. For myself, I have a _very_ small userbase and do
purely manual training with a small corpus. I have under 3000 tokens
total and get good results.
Build good representative ham and spam corpa, and train any misses (FPs
and FNs) going forward. Retain those messages. Unfortunately autolearn
doesn't let you retain those messages.
Retraining from scratch is only really necessary if things have gone
completely out of whack, and at that point you review your corpa carefully
for misclassified messages, wipe and retrain. Bayes should only go bonkers
if you have people manually training messages incorrectly, or (not too
likely) if autolearn has taken a slightly-poor configuration and magnified
the errors.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Gun Control enables genocide while doing little to reduce crime.
-----------------------------------------------------------------------
4 days until Thomas Jefferson's 266th Birthday