You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Mun Fai <mu...@viewqwest.com> on 2005/07/02 23:59:44 UTC

Bayesian filtering test

Hi all

I'm using SpamAssassin 3.0.1 on a qmail system, and invoking it using qmail-scanner.

I have just enabled Bayesian filtering, with the following configuration options:

use_bayes 1
bayes_store_module Mail::SpamAssassin::BayesStore::SQL
bayes_sql_dsn DBI:mysql:spamassassin:xxx:xxx
bayes_sql_username xxx
bayes_sql_password xxx
bayes_auto_learn 1
bayes_auto_learn_threshold_nonspam 0.50
bayes_auto_learn_threshold_spam 12.00
bayes_min_ham_num 200
bayes_min_spam_num 200

I know for sure that autolearning is working, and running sa-learn on maildirs of ham and spam don't generate any errors.

Now, how do I ensure that the Bayesian classifier is actually being invoked to score new messages? I tried the following test:

1. Send a test mail to my account, and explicitly run sa-learn on the mail to identify it as spam. This is repeated for abt 15 times.
2. Run sa-learn on all my other emails and identify them as ham.
3. Reduce both bayes_min_ham_num and bayes_min_spam_num to 5, and restarted spamd
4. Send the test mail again

The test mail at step 4 does not get classified as spam, and when I check the headers of the mail, no BAYES_* rules have been run on it.

Am I doing the right thing here?


Rgds
Lee

Re: Bayesian filtering test

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Mun,

Saturday, July 2, 2005, 2:59:44 PM, you wrote:

MF> I'm using SpamAssassin 3.0.1 on a qmail system, and invoking it
MF> using qmail-scanner. 

MF> 1. Send a test mail to my account, and explicitly run
MF> sa-learn on the mail to identify it as spam. This is repeated for
MF> abt 15 times.

Not necessary -- once you sa-learn it once, you get no benefit from
subsequence sa-learns.
MF> 3. Reduce both bayes_min_ham_num and bayes_min_spam_num to 5, and
MF> restarted spamd 
MF> 4. Send the test mail again

MF> The test mail at step 4 does not get classified as spam, and
MF> when I check the headers of the mail, no BAYES_* rules have been
MF> run on it.

MF> Am I doing the right thing here?

Possibly.

Take that same email, with full headers, as a text file on disk, and
while you are logged on as the same user that spamd uses when kicked
off by your qmail system, do
> spamassassin -D <thatemail.txt >thatemail.out 2>thatemail.err

The thatemail.err file should contain debugging output on your email,
including some lines like

[2676] dbg: bayes: tie-ing to DB file R/O /home/Bob/.spamassassin/bayes_toks
[2676] dbg: bayes: tie-ing to DB file R/O /home/Bob/.spamassassin/bayes_seen
[2676] dbg: bayes: found bayes db version 3
[2676] dbg: bayes: DB journal sync: last sync: 0
[2676] dbg: config: using "/home/Bob/.spamassassin" for user state dir
[2676] dbg: bayes: not available for scanning, only 0 spam(s) in bayes DB < 200
[2676] dbg: bayes: untie-ing
[2676] dbg: bayes: untie-ing db_toks
[2676] dbg: bayes: untie-ing db_seen
[2676] dbg: config: score set 1 chosen.
[2676] dbg: bayes: tie-ing to DB file R/O /home/Bob/.spamassassin/bayes_toks
[2676] dbg: bayes: tie-ing to DB file R/O /home/Bob/.spamassassin/bayes_seen
[2676] dbg: bayes: found bayes db version 3
[2676] dbg: bayes: DB journal sync: last sync: 0
[2676] dbg: bayes: not available for scanning, only 0 spam(s) in bayes DB < 200
[2676] dbg: bayes: untie-ing
[2676] dbg: bayes: untie-ing db_toks
[2676] dbg: bayes: untie-ing db_seen

You'll note my
[2676] dbg: bayes: not available for scanning, only 0 spam(s) in bayes DB < 200

This is because on my test system I haven't trained Bayes.

Instead, you should see other code indicating what is happening.

One possibility: If you're doing sa-learn as one user, and qmail is
invoking spamd to run as a different user, you may be training one
Bayes database and spamd may be using a different database that isn't
trained.

Bob Menschel



MF> Rgds
MF> Lee 



-- 
Best regards,
 Robert                            mailto:Robert@Menschel.net

Re: Bayesian filtering test

Posted by Loren Wilton <lw...@earthlink.net>.

bayes_min_ham_num 200
bayes_min_spam_num 200

Bayes won't run until you have learned at least 200 each of spam and ham.
So you normally won't expect it to work for the first day or so after you
turn it on.

Once it is running, you should start seeing rule hits in your mails like
BAYES_00 (for ham) or BAYES_99 (for spam).  There are other rules, but they
all start with BAYES_??.

I'm a little concerned with your bayes ham learning threshold, it seems a
little high to me.  You moght consider taking it down to .2 or .1 or so.
Many people like having it near -.1.

You can also run spamassassin -D <mail.msg and look at the debug output.  It
will tell you quite a lot about what Bayes is doing, if anything.

        Loren