You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by mizzio <mi...@sinapto.net> on 2005/05/23 13:07:28 UTC

bayes training question

Hi everybody,

first post on this list so please be patient if I'm asking dumb
questions. :-)

I have a spamassassin 3.0.3 mail gateway which is working pretty well
and I train the bayesian DB everyday.
I have a couple of questions on this:

- I get some messages marked as SPAM coming form this mailing list,
since the body contains URLs and text from real spam messages: do I have
to feed them in my DB as ham or this can cause some kind of bayes
poisoning ?

- I assume that the training is more important for the messages marked
with BAYES_50 BODY: Bayesian spam probability is 40 to 60% [score:
0.5998]; is this correct ?

- Shall I train as ham also the messages not marked as SPAM but having a
score close between 1/2 and 3/4 ? I mean, feeding also "normal" messages
into the system helps to have a good bayes filtering ?

- Does the opposite is also true ? Feeding already marked messages can
enforce the bayesian filtering ?


Thank you to everybody for your time and attention,
mizzio

Re: bayes training question

Posted by Roman Volf <vo...@keystreams.com>.

Jim Maul wrote:

>
>
> It fixes my problem of list messages being autolearned incorrectly, 
> but i'd rather not scan them at all.  Someone made a suggestion (and 
> patch) on the qmail scanner mailing list where you can optionally turn 
> SA scanning off using tcp.smtp from certain ip's.  I may use this to 
> not pass messages coming from the apache mail server through SA.  You 
> may want to check that list out as well.
>
> -Jim

I should have read your whole email before replying, but anyway here is 
the link to the patch I posted to the qmail-scanner list:

http://www.thevolf.com/qmail/qmail-scanner-skip-sa.patch


-- 
Roman Volf
Keystreams Internet Solutions
volfman@keystreams.com

Re: bayes training question

Posted by Roman Volf <vo...@keystreams.com>.

> Until i can come up with a way to not scan some emails selectively 
> using qmail-scanner (without procmail) i have setteled on using the 
> following statements in my local.cf
>
> bayes_ignore_to users@spamassassin.apache.org
> whitelist_to users@spamassassin.apache.org
>
> This causes (most) list messages to not be marked as spam and all list 
> messages are ignored by bayes.
>
> It fixes my problem of list messages being autolearned incorrectly, 
> but i'd rather not scan them at all.  Someone made a suggestion (and 
> patch) on the qmail scanner mailing list where you can optionally turn 
> SA scanning off using tcp.smtp from certain ip's.  I may use this to 
> not pass messages coming from the apache mail server through SA.  You 
> may want to check that list out as well.
>
> -Jim

Jim,

You can use a patch I wrote yesterday to qmail-scanner-queue.pl 
http://www.thevolf.com/qmail/qmail-scanner-skip-sa.patch.

Then add:

209.237.227.199:allow,IGNORE_SA="yes"

Do your tcp.smtp and rebuild the tcp.smtpd.cdb file.

This will cause qmail-scanner to skip SA tests for email originating 
from the spamassassin mailing list (hermes.apache.org).

-- 
Roman Volf
Keystreams Internet Solutions
volfman@keystreams.com

Re: bayes training question

Posted by mizzio <mi...@sinapto.net>.

Very nice solution for my needs.

Thank you !
maurizio

Il giorno mar, 24-05-2005 alle 13:08 -0400, Jim Maul ha scritto:
> mizzio wrote:
> > Hello guys,
> > sorry to bother you again: I didn't find a way to exclude this mailing
> > list from SA scanning in my setup.
> > I'm using qmail + qmail-scanner + spamassassin on my mailserver, the
> > only posts I found are about excluding the scanning with procmail (which
> > I'm not using).
> > 
> > I did not find a way of doing this through qmail-scanner: is there a way
> > of doing this directly with spamassassin ?
> > 
> > Any idea is greatly appreciated.
> > 
> > Thank you and regards,
> > Maurizio
> > 
> >
> 
> Until i can come up with a way to not scan some emails selectively using 
> qmail-scanner (without procmail) i have setteled on using the following 
> statements in my local.cf
> 
> bayes_ignore_to users@spamassassin.apache.org
> whitelist_to users@spamassassin.apache.org
> 
> This causes (most) list messages to not be marked as spam and all list 
> messages are ignored by bayes.
> 
> It fixes my problem of list messages being autolearned incorrectly, but 
> i'd rather not scan them at all.  Someone made a suggestion (and patch) 
> on the qmail scanner mailing list where you can optionally turn SA 
> scanning off using tcp.smtp from certain ip's.  I may use this to not 
> pass messages coming from the apache mail server through SA.  You may 
> want to check that list out as well.
> 
> -Jim
>

Re: bayes training question

Posted by Jim Maul <jm...@elih.org>.

mizzio wrote:
> Hello guys,
> sorry to bother you again: I didn't find a way to exclude this mailing
> list from SA scanning in my setup.
> I'm using qmail + qmail-scanner + spamassassin on my mailserver, the
> only posts I found are about excluding the scanning with procmail (which
> I'm not using).
> 
> I did not find a way of doing this through qmail-scanner: is there a way
> of doing this directly with spamassassin ?
> 
> Any idea is greatly appreciated.
> 
> Thank you and regards,
> Maurizio
> 
>

Until i can come up with a way to not scan some emails selectively using 
qmail-scanner (without procmail) i have setteled on using the following 
statements in my local.cf

bayes_ignore_to users@spamassassin.apache.org
whitelist_to users@spamassassin.apache.org

This causes (most) list messages to not be marked as spam and all list 
messages are ignored by bayes.

It fixes my problem of list messages being autolearned incorrectly, but 
i'd rather not scan them at all.  Someone made a suggestion (and patch) 
on the qmail scanner mailing list where you can optionally turn SA 
scanning off using tcp.smtp from certain ip's.  I may use this to not 
pass messages coming from the apache mail server through SA.  You may 
want to check that list out as well.

-Jim

Re: bayes training question

Posted by jdow <jd...@earthlink.net>.

He's on a machine I administer. I have a "thing" about autolearn.
I don't do it. So this is not a problem for Loren. (It's too big
a pain to repair a self-mis-trained bayes database. So the neat
selectively trained bayes databases we have work quite nicely as
a result.) I cannot see the long term utility of autolearn whereas
I can see its long term futility.

{^_-}

From: "mizzio" <mi...@sinapto.net>

> Loren,
> 
> it works:
> 
> X-Spam-Status: No, hits=-56.2 required=4.5
> X-Spam-Report: SA TESTS -100 USER_IN_WHITELIST_SA   SA List 2.4 BAYES_50
> BODY: Bayesian spam probability is 40 to 60% [score: 0.4439]
> 
> 
> One more question: I understand that in this way the mail are never
> marked at spam, but they are autolearned by the system.
> Is this correct ? 
> 
> Thank you,
> Maurizio
> 
> 
> Il giorno mar, 24-05-2005 alle 05:21 -0700, Loren Wilton ha scritto:
> > header  WHITELIST_SA   List-Id =~
> > /(?:dev|users)\.spamassassin\.apache\.org/i
> > describe WHITELIST_SA   SA List
> > score  WHITELIST_SA   -100

Re: bayes training question

Posted by mizzio <mi...@sinapto.net>.

Loren,

it works:

X-Spam-Status: No, hits=-56.2 required=4.5
X-Spam-Report: SA TESTS -100 USER_IN_WHITELIST_SA   SA List 2.4 BAYES_50
BODY: Bayesian spam probability is 40 to 60% [score: 0.4439]


One more question: I understand that in this way the mail are never
marked at spam, but they are autolearned by the system.
Is this correct ? 

Thank you,
Maurizio


Il giorno mar, 24-05-2005 alle 05:21 -0700, Loren Wilton ha scritto:
> header  WHITELIST_SA   List-Id =~
> /(?:dev|users)\.spamassassin\.apache\.org/i
> describe WHITELIST_SA   SA List
> score  WHITELIST_SA   -100

Re: bayes training question

Posted by Loren Wilton <lw...@earthlink.net>.

> I did not find a way of doing this through qmail-scanner: is there a way
> of doing this directly with spamassassin ?

Possibly someone else knows of a way with qmail-scanner.  If not, you can't
"exclude" it with SA, but you *can* whitelist the list with SA.  That will
probably be sufficient.

This will do the trick for you:

header  WHITELIST_SA   List-Id =~
/(?:dev|users)\.spamassassin\.apache\.org/i
describe WHITELIST_SA   SA List
score  WHITELIST_SA   -100

Someone will doubtless point out that this test is forgable, and potentially
will let real spam into your system.  I haven't had it happen yet.  But the
possibility exists.

        Loren

Re: bayes training question

Posted by mizzio <mi...@sinapto.net>.

Hello guys,
sorry to bother you again: I didn't find a way to exclude this mailing
list from SA scanning in my setup.
I'm using qmail + qmail-scanner + spamassassin on my mailserver, the
only posts I found are about excluding the scanning with procmail (which
I'm not using).

I did not find a way of doing this through qmail-scanner: is there a way
of doing this directly with spamassassin ?

Any idea is greatly appreciated.

Thank you and regards,
Maurizio

> The best thing is to avoid having the mail from this list go through SA.
> There are various ways to do this, depending on your mail setup.

Re: bayes training question

Posted by mizzio <mi...@sinapto.net>.

Thank very much Loren.

regards,
mizzio

Il giorno lun, 23-05-2005 alle 04:51 -0700, Loren Wilton ha scritto:
> > - I get some messages marked as SPAM coming form this mailing list,
> > since the body contains URLs and text from real spam messages: do I have
> > to feed them in my DB as ham or this can cause some kind of bayes
> > poisoning ?
> 
> The best thing is to avoid having the mail from this list go through SA.
> There are various ways to do this, depending on your mail setup.
> 
> 
> > - I assume that the training is more important for the messages marked
> > with BAYES_50 BODY: Bayesian spam probability is 40 to 60% [score:
> > 0.5998]; is this correct ?
> 
> Probably most important are cases where Bayes guessed wrong, rather than
> simply not being real sure.  Always train as ham or spam anything you see
> that Bayes decided to lean the other way.  This way it will get to know what
> is what for you.
> 
> Second most important would be training stuff that scores close to 50%.
> Personally I tend to dump most spam that scores less than about 80% into the
> spam training bucket.  Now and then I'll throw a handful of known ham in the
> ham bucket, to try to keep the number of learned ham/spam somewhat balaced.
> 
> 
> > - Shall I train as ham also the messages not marked as SPAM but having a
> > score close between 1/2 and 3/4 ? I mean, feeding also "normal" messages
> > into the system helps to have a good bayes filtering ?
> 
> I'm not absolutely sure what you are saying here.  If you are asking if you
> should train known ham as ham, the answer is yes.  Bayes needs to be able to
> decide which tokens are ham and which are spam.  It can only do this if it
> sees both ham and spam.  If you have ham that is hitting more than 20 or 30%
> you should certainly train it as ham.  However, even throwing ham that
> scores near 0 into training every so often is a good idea.
> 
>         Loren
> 
>

changing bayes from individual to global

Posted by li...@zeta.net.

Hello,

I am currently running SA 3.03 with bayes, and bayes is storing all of its
data into a mysql DB.  As this was all installed via PSoft's HSphere, I didn't
actually pick the specific configuration myself.  On my system, bayes is
trained per mailbox individually, and I am wanting to change this so that
there is one global bayes database for the entire mail server.  Is this an
easy modification to make?  Any help would be appreciated.

Regards,
Devin

Re: bayes training question

Posted by Loren Wilton <lw...@earthlink.net>.

> - I get some messages marked as SPAM coming form this mailing list,
> since the body contains URLs and text from real spam messages: do I have
> to feed them in my DB as ham or this can cause some kind of bayes
> poisoning ?

The best thing is to avoid having the mail from this list go through SA.
There are various ways to do this, depending on your mail setup.


> - I assume that the training is more important for the messages marked
> with BAYES_50 BODY: Bayesian spam probability is 40 to 60% [score:
> 0.5998]; is this correct ?

Probably most important are cases where Bayes guessed wrong, rather than
simply not being real sure.  Always train as ham or spam anything you see
that Bayes decided to lean the other way.  This way it will get to know what
is what for you.

Second most important would be training stuff that scores close to 50%.
Personally I tend to dump most spam that scores less than about 80% into the
spam training bucket.  Now and then I'll throw a handful of known ham in the
ham bucket, to try to keep the number of learned ham/spam somewhat balaced.


> - Shall I train as ham also the messages not marked as SPAM but having a
> score close between 1/2 and 3/4 ? I mean, feeding also "normal" messages
> into the system helps to have a good bayes filtering ?

I'm not absolutely sure what you are saying here.  If you are asking if you
should train known ham as ham, the answer is yes.  Bayes needs to be able to
decide which tokens are ham and which are spam.  It can only do this if it
sees both ham and spam.  If you have ham that is hitting more than 20 or 30%
you should certainly train it as ham.  However, even throwing ham that
scores near 0 into training every so often is a good idea.

        Loren