You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by sinnerman <kr...@excite.com> on 2007/10/18 01:27:52 UTC

help with training bayesian filter

Hi All,

I currently have SpamAssaassin setup on my FreeBSD machine and have trained
it with spam and ham messages (greater than the min thresholds of 200/200).
However, I'm not sure it's setup correctly, nor do I see any obvious results
(reduced spam) of the training process. A couple of questions:

* I'm running sa-learn from my own cron (i.e., as my login user), but I'm
running spamd as "nobody". Since (I believe) the bayesian database is being
created in my home directory, will spamd be able to access it, or will it
instead try to access another database? Spamc is also being run from my own
account. If this setup is not correct, how can I fix it?

* I cannot get autolearn to work. I've set "bayes_auto_learn_threshold_spam"
to 8, but even with messages which receive a score greater than 8, the
message's X-Spam-Status header still says "autolearn=no". Do I need to
enable autolearning in some other way?

Thanks.
-- 
View this message in context: http://www.nabble.com/help-with-training-bayesian-filter-tf4643977.html#a13265066
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

some of us never use bayes

Posted by ji...@jidanni.org.

LW> At that point Bayes should kick in.  Now you get to the hard part.
LW> You need to watch Bayes like a hawk for a few weeks to make sure you

I have always had them turned off
use_auto_whitelist 0
use_bayes 0
and seem to be doing fine.
A couple spams slip through each day out of 100.

The key to comfort is when reviewing the spam directory to see if we
accidentally caught a friend's mail, be sure to view it in order of
increasing spam score. http://jidanni.org/comp/spam/spamdealer.html

Re: help with training bayesian filter

Posted by Loren Wilton <lw...@earthlink.net>.

I think the first things I'd do would be to make some adjustments to the 
settings:

> bayes_auto_learn_threshold_nonspam  0.2
> bayes_min_ham_num 200

And probably leave the rest the same.

Then I'd train on 200 hams, which you can go back into history to get; your 
ham messages probably don't change much year to year.

Also train at least 200 spams, which should be easy.  In this case though 
you want recent junk, not somethign from 6 months ago.  If it takes 6 days 
until you have enough spam to trigger bayes, that's fine, just wait for it.

At that point Bayes should kick in.  Now you get to the hard part.  You need 
to watch Bayes like a hawk for a few weeks to make sure you really got it 
trained right!  If you do this, and feed it corrections when you don't like 
how it scored a ham or spam, you will be fine.  If you *don't* do this, you 
will probably end up with Bayes going odd on a tangent, and you may end up 
with a database that is so badly trashed you will have to throw it away and 
start over.

But this watching closely business and feeding in corrections to get things 
right should only take a few weeks at most, unless the kind of mail you get 
changes.  I've had bayes running for years on the same database, and quite 
honestly I haven't had to train a message in probably a year now.  I also 
don't run auto-learning, and it is still giving me bayes_99 on my spams and 
numbers around 0 to 10 on my hams.  I guess that means my message types 
don't change much.  ;-)

        Loren

----- Original Message ----- 
From: "sinnerman" <kr...@excite.com>
To: <us...@spamassassin.apache.org>
Sent: Wednesday, October 17, 2007 8:49 PM
Subject: Re: help with training bayesian filter

>
> I'm running spamd as:
>
> spamd -d -l -u nobody --siteconfigpath=<my site config's path>
>
> My config file is:
>
> required_hits   4
> bayes_auto_learn_threshold_nonspam  1
> bayes_auto_learn_threshold_spam     8
> bayes_min_ham_num 100
> score BAYES_99 5
>
> I don't have bayes_auto_learn set explicitly, but the docs indicate that
> enabled is the default setting.
>
>
> Mr. Gus wrote:
>>
>> I have a systemwide config so I don't know from experience, but are you
>> running spamd with -x or setting the user with -u? Because if you are,
>> that
>> might be mucking you up.
>>
>> Do you have bayes_auto_learn set? That's what turns it on/off.
>>
>> -- 
>> Gus
>>
>>
>
> -- 
> View this message in context: 
> http://www.nabble.com/help-with-training-bayesian-filter-tf4643977.html#a13267625
> Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: help with training bayesian filter

Posted by Matt Kettler <mk...@verizon.net>.

sinnerman wrote:
> I'm running spamd as:
>
> spamd -d -l -u nobody --siteconfigpath=<my site config's path>
>   
Is there a particular reason why you're using the --siteconfigpath?

The reason I ask is nearly everyone I've seen use this option, mis-uses
it. The only time you should want to use this option is if you need to
have multiple different site configurations. Otherwise you're just
over-specifying things that SA will in general do a better job of
figuring out on its own.

In particular this option should NOT point to a directory that contains
50_scores.cf. That's the "default rules" directory, not the "site config
directory".

By default, the site config is either /etc/mail/spamassassin or
/etc/spamassassin (SA will search this and other similar options, and
pick the first one it finds). It should contain your *.pre files, your
local.cf (if you have one), and .cf files for any add-on rulesets you
choose to manually add.

Re: help with training bayesian filter

Posted by sinnerman <kr...@excite.com>.

I'm running spamd as:

spamd -d -l -u nobody --siteconfigpath=<my site config's path>

My config file is:

required_hits   4
bayes_auto_learn_threshold_nonspam  1
bayes_auto_learn_threshold_spam     8
bayes_min_ham_num 100
score BAYES_99 5

I don't have bayes_auto_learn set explicitly, but the docs indicate that
enabled is the default setting.


Mr. Gus wrote:
> 
> I have a systemwide config so I don't know from experience, but are you
> running spamd with -x or setting the user with -u? Because if you are,
> that
> might be mucking you up.
> 
> Do you have bayes_auto_learn set? That's what turns it on/off.
> 
> -- 
> Gus
> 
> 

-- 
View this message in context: http://www.nabble.com/help-with-training-bayesian-filter-tf4643977.html#a13267625
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: help with training bayesian filter

Posted by "Mr. Gus" <mr...@disco-zombie.net>.

On Wed, Oct 17, 2007 at 04:27:52PM -0700, sinnerman wrote:
> 
> Hi All,
> 
> I currently have SpamAssaassin setup on my FreeBSD machine and have trained
> it with spam and ham messages (greater than the min thresholds of 200/200).
> However, I'm not sure it's setup correctly, nor do I see any obvious results
> (reduced spam) of the training process. A couple of questions:
> 
> * I'm running sa-learn from my own cron (i.e., as my login user), but I'm
> running spamd as "nobody". Since (I believe) the bayesian database is being
> created in my home directory, will spamd be able to access it, or will it
> instead try to access another database? Spamc is also being run from my own
> account. If this setup is not correct, how can I fix it?

I have a systemwide config so I don't know from experience, but are you
running spamd with -x or setting the user with -u? Because if you are, that
might be mucking you up.

> * I cannot get autolearn to work. I've set "bayes_auto_learn_threshold_spam"
> to 8, but even with messages which receive a score greater than 8, the
> message's X-Spam-Status header still says "autolearn=no". Do I need to
> enable autolearning in some other way?

Do you have bayes_auto_learn set? That's what turns it on/off.

-- 
Gus

Re: help with training bayesian filter

Posted by Matt Kettler <mk...@verizon.net>.

sinnerman wrote:
> I think I've solved the issues:
>
> * I've stoped using spamc/spamd, and now just use spamassassin (running as
> my logged in user, just like sa-learn). I think that has solved the issue of
> which bayesian database is being used.
>   
Well, your spamd startup was forcing everything to '-u nobody'. Unless
you're using SQL, that's going to break bayes.

The "nobody" user does not, and SHOULD NOT, have a writable home
directory, so any ordinary db_file-based database stored in the home
directory (which is SA's default) is going to fail to be written.

> * I had to explicitly load the plugin in my config file (I though it was a
> standard plugin):
>
> loadplugin     Mail::SpamAssassin::Plugin::AutoLearnThreshold
>   
Define "config file". That should be loaded in yourv310.pre file by
default. If your *.pre files are missing from /etc/mail/spamassassin,
you've got larger problems to look into.

> Also, I've found some other help topics on this, and now understand that
> autolearn doesn't rely just on the computed score. 
>

Re: help with training bayesian filter

Posted by sinnerman <kr...@excite.com>.

I think I've solved the issues:

* I've stoped using spamc/spamd, and now just use spamassassin (running as
my logged in user, just like sa-learn). I think that has solved the issue of
which bayesian database is being used.

* I had to explicitly load the plugin in my config file (I though it was a
standard plugin):

loadplugin     Mail::SpamAssassin::Plugin::AutoLearnThreshold

Also, I've found some other help topics on this, and now understand that
autolearn doesn't rely just on the computed score. 
-- 
View this message in context: http://www.nabble.com/help-with-training-bayesian-filter-tf4643977.html#a13267913
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.