You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Chavdar Videff <ch...@mr-bricolage.bg> on 2005/05/30 17:00:50 UTC

false positives and negatives

Dear List,

I know these are subject of the FAQ and the documentation, yet after I read 
all of it I didn't get an answer to the following questions:

1. At our site we get approx. 1000 spam a week. Most of it is rated below 2.0 
points and gets through (even if we set required hits to 3 and 2 for certain 
mailboxes).

2. Mail composed as HTML is rated as spam for the above reason.

What can we do to improve the situation and boost the performance of SA.

I assume that if we set required hits below 5.0, ham messages composed as HTML 
will be rated as spam. However, the overwhelming number of spam rated below 
4, 3, 2 and even 1 points that we receive renders spamassassin useless for 
our mail-server.

We sort ham and spam and run sa-learn daily in order to train SA, we feed the 
low-rated spam and ham that is not rated correctly to sa-learn without any 
success: most messages (that are repeated) continue to go through.

Please help.

Why doesn't sa-learn help. We thought that if we submit to sa-learn a messages 
that was mistaken, the next time a message that is the same or from the same 
address will be sorted correctly.


Following is the configuration file (debian sid, sendmail, sitewide 
configuration of SA).

mail1:/home/chavdar# cat /etc/mail/spamassassin/local.cf
# This is the right place to customize your installation of SpamAssassin.
#
# See 'perldoc Mail::SpamAssassin::Conf' for details of what can be
# tweaked.
#
###########################################################################
#
# rewrite_header Subject *****SPAM*****
# report_safe 1
# trusted_networks 10.50
# lock_method flock

required_hits 3
rewrite_subject 1
report_header 1
use_terse_report 1
defang_mime 0
report_safe 0
use_bayes 1
auto_learn 1

Regards

Chavdar Videff

Re: false positives and negatives

Posted by Loren Wilton <lw...@earthlink.net>.
It doesn't appear anyone else has replied so...


> Sorry for the stupid question, but referring to the SpamAssassin web-site
I
> could not get an answer to the following question:
> How do I safely remove my existing bayes database?

Just have to remove the files.  There are usually three as best I recall,
including the journal file.  Obviously best to stop SA while doing this.


> > The next thing that could help you is to enable net tests, specifically
the
> > SURBL checks.  These will catch a lot of your spams.
>
> Isn't it enabled by default? I start spamd without any options.

It isn't enabled if net checks aren't enabled, and I don't recall if net
checks are enabled by default.  I wouldn't think so, but maybe.

Also, even if net checks are enabled if you don't have a new enough version
of Net::DNS then they won't run.  Doing a 'spamassassin -D' will tell you if
the net checks are enabled and if Net::DNS is a new enough version.

I suspect they are not running for you, because if they were they would
probably have been catching most of your spam.


Also SURBL added another test, I believe JP, since the 3.0.2 release.  This
requires adding a line or two manually to enable the test.  Docs are I
believe on the Surbl site.

        Loren


Re: false positives and negatives

Posted by Chavdar Videff <ch...@mr-bricolage.bg>.
On Tuesday 31 May 2005 10:13, Loren Wilton wrote:
> The spam you show is difficult to handle.  One important thing is there is
> no url or other link in the message body to a drug site where people could
> get the spammed product.  I am assuking the original spam much have had
> such, since a spam without a link is fairly useless.  If you are getting

I just quoted the biginning of the message...

>
> Next remove your existing bayes database and start over.  You will need to
> manually train it on at least 200 each ham and spam.  If you make a couple
> of mbox files, one with manually sorted spam, the other manually sorted
> ham, and feed these to sa-learn correctly, you should be able to get bayes
> working for you in no more than a day or two, probably only a few hours,
> depending on your mail rate.
>

Sorry for the stupid question, but referring to the SpamAssassin web-site I 
could not get an answer to the following question: 
How do I safely remove my existing bayes database?

>
> That should get bayes on your side pretty quickly.
>
> The next thing that could help you is to enable net tests, specifically the
> SURBL checks.  These will catch a lot of your spams.
>

Isn't it enabled by default? I start spamd without any options.

>         Loren

Re: false positives and negatives

Posted by jdow <jd...@earthlink.net>.
You have several options. I run about 40 of them. Most of them are found
at http://www.rulesemporium.com/ the human generated Bayes databases that
work on phrases rather than single words.

{^_-}
----- Original Message ----- 
From: "Chavdar Videff" <ch...@mr-bricolage.bg>


On Tuesday 31 May 2005 05:16, Loren Wilton wrote:
> > 1. At our site we get approx. 1000 spam a week. Most of it is rated
below
>
> 2.0
>
> > points and gets through (even if we set required hits to 3 and 2 for
>
> certain
>
> > mailboxes).
>
> I assume you mean here that you have 1000 spam a week leaking through?  Or
> do you mean that you have 1000 spam a week TOTAL and ALL of it gets
> through?
>
> Setting the required hits below 5, or certainly below 4, is not the
answer.
> You have something else wrong, I would say severely wrong.  You have Bayes
> turned on, and it should be taking care of the vast majority of this sort
> of thing, if it is properly trained.
>
> If Bayes is improperly trained, it could be causing your problem, by
> claiming that some class (or possibly all) of your spam is really ham, and
> lowering the score.
>
> That is about the limit of the help we can give from what you posted.  If
> you posted a typical spam *with complete headers* and *with the scores you
> got* we would be able to look at it and probably spot some obvious
> problems. As it is, all we can do is guess.
>
>         Loren

Sorry for my late reply - my evening is your morning.
There is 1000 spam a week that leaks through and perhaps another 500-600
that
get filtered by spamassassin.
If my Bayes is poorly trained what options do I have.
Here is a typical letter that gets through.

============================================================================
=======
Return-Path: <le...@street67.net>
 Received: from fw.doverie.bg (doh-gw.customer.0rbitel.net [195.24.44.114])
by mail1.mr-bricolage.bg (8.13.3/8.13.3/Debian-6) with SMTP id
j4V11DGj014435
for <ch...@mr-bricolage.bg>; Tue, 31 May 2005 04:01:15 +0300
 Received: (qmail 13680 invoked by uid 507); 31 May 2005 00:58:54 -0000
 Delivered-To: doverie.bg-andrei@doverie.bg
 Received: (qmail 13672 invoked by uid 503); 31 May 2005 00:58:48 -0000
 Received: from lennon@street67.net by fw.doverie.bg by uid 500 with
qmail-scanner-1.15
(f-prot: 3.12. Clear:.
Processed in 12.821956 secs); 31 May 2005 00:58:48 -0000
 Received: from cow100.orbitel.bg (HELO ns.orbitel.bg) (195.24.32.18)
by 0 with SMTP; 31 May 2005 00:58:20 -0000
 Received: (qmail 607 invoked from network); 31 May 2005 01:01:36 -0000
 Received: from unknown (HELO street67.net) (219.134.152.97)
by ns.orbitel.bg with SMTP; 31 May 2005 01:01:36 -0000
 Message-ID: <00...@street67.net>
 Date: Mon, 30 May 2005 16:15:11 +1100
 From: "michael torrey" <le...@street67.net>
 User-Agent: QUALCOMM Windows Eudora Version 6.0.0.22
 X-Accept-Language: en-us
 MIME-Version: 1.0
 To: "Elden Irving" <an...@doverie.bg>
 Cc: <ji...@doverie.bg>,
<do...@doverie.bg>
 Subject: It is all about quality tableets sold at the finest prices.
 Content-Type: text/plain;
charset="us-ascii"
 Content-Transfer-Encoding: 7bit
 X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on
mail1.mr-bricolage.bg
 X-Spam-Level:
 X-Spam-Status: No, score=0.1 required=2.0 tests=FORGED_RCVD_HELO
autolearn=ham version=3.0.2
 Status: R
 X-Status: N
 X-KMail-EncryptionState:
 X-KMail-SignatureState:
 X-KMail-MDN-Sent:

At our rxdrug-site, you can choose top-selling rxmeds at a reduced prices.
Legitimate way to e-shoppe for tableets. We provide customers flexible and
reliable distribution services.
======================================================================

Regards

Chavdar Videff



Re: false positives and negatives

Posted by Loren Wilton <lw...@earthlink.net>.
> Sorry for my late reply - my evening is your morning.
> There is 1000 spam a week that leaks through and perhaps another 500-600
that
> get filtered by spamassassin.
> If my Bayes is poorly trained what options do I have.
> Here is a typical letter that gets through.
>
>
============================================================================
=======
> Return-Path: <le...@street67.net>
>  Received: from fw.doverie.bg (doh-gw.customer.0rbitel.net
[195.24.44.114])
> by mail1.mr-bricolage.bg (8.13.3/8.13.3/Debian-6) with SMTP id
> j4V11DGj014435
> for <ch...@mr-bricolage.bg>; Tue, 31 May 2005 04:01:15 +0300
>  Received: (qmail 13680 invoked by uid 507); 31 May 2005 00:58:54 -0000
>  Delivered-To: doverie.bg-andrei@doverie.bg
>  Received: (qmail 13672 invoked by uid 503); 31 May 2005 00:58:48 -0000
>  Received: from lennon@street67.net by fw.doverie.bg by uid 500 with
> qmail-scanner-1.15
> (f-prot: 3.12. Clear:.
> Processed in 12.821956 secs); 31 May 2005 00:58:48 -0000
>  Received: from cow100.orbitel.bg (HELO ns.orbitel.bg) (195.24.32.18)
> by 0 with SMTP; 31 May 2005 00:58:20 -0000
>  Received: (qmail 607 invoked from network); 31 May 2005 01:01:36 -0000
>  Received: from unknown (HELO street67.net) (219.134.152.97)
> by ns.orbitel.bg with SMTP; 31 May 2005 01:01:36 -0000
>  Message-ID: <00...@street67.net>
>  Date: Mon, 30 May 2005 16:15:11 +1100
>  From: "michael torrey" <le...@street67.net>
>  User-Agent: QUALCOMM Windows Eudora Version 6.0.0.22
>  X-Accept-Language: en-us
>  MIME-Version: 1.0
>  To: "Elden Irving" <an...@doverie.bg>
>  Cc: <ji...@doverie.bg>,
> <do...@doverie.bg>
>  Subject: It is all about quality tableets sold at the finest prices.
>  Content-Type: text/plain;
> charset="us-ascii"
>  Content-Transfer-Encoding: 7bit
>  X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on
> mail1.mr-bricolage.bg
>  X-Spam-Level:
>  X-Spam-Status: No, score=0.1 required=2.0 tests=FORGED_RCVD_HELO
> autolearn=ham version=3.0.2
>  Status: R
>  X-Status: N
>  X-KMail-EncryptionState:
>  X-KMail-SignatureState:
>  X-KMail-MDN-Sent:
>
> At our rxdrug-site, you can choose top-selling rxmeds at a reduced prices.
> Legitimate way to e-shoppe for tableets. We provide customers flexible and
> reliable distribution services.
> ======================================================================

It is holiday in the US, so you probably won't receive more replies for some
hours.

The spam you show is difficult to handle.  One important thing is there is
no url or other link in the message body to a drug site where people could
get the spammed product.  I am assuking the original spam much have had
such, since a spam without a link is fairly useless.  If you are getting
spams without links similar to this, then other methods, such as writing
some custom rules, would be required to eliminate the problem.

Bayes did not trigger on this message, either for or against.  I'm somewhat
surprised that Bayes didn't even show a BAYES_50 score though.  So bayes is
neither helping nor hindering.  It should be helping.  But that gets us to
the next point:

> autolearn=ham

Bayes autolearn is enabled, as it is by default.  Since this got a low
score, it has been learned as ham rather than spam.  Sooner or later Bayes
will start helping messages like this get through by giving them scores of
BAYES_00.

You could back this particular message out of Bayes by learning it manually
as spam.  However, if you are having 1000 messages a week leak through with
low scores, your Bayes database probably believes that all spams are haps at
this point.  So there is no point in learning individual messages correctly
just yet; your bayes database is probably junk.

Start by setting bayes_auto_learn to 0 in local.cf to disable auto
learning - it is doing mych more harm than good at this point.  Later you
will probably be able to turn it back on, once you have a Bayes database
that knows spam from ham.  But not yet.

Also add a score line for BAYES_99 to fix the poor scoring in 3.0.2 for this
rule:
    score BAYES_99    4
should do the trick.

Next remove your existing bayes database and start over.  You will need to
manually train it on at least 200 each ham and spam.  If you make a couple
of mbox files, one with manually sorted spam, the other manually sorted ham,
and feed these to sa-learn correctly, you should be able to get bayes
working for you in no more than a day or two, probably only a few hours,
depending on your mail rate.

Keep training bayes manually every now and then.  You should get a good base
of at least a few thousand hams and spams each, representative of the sort
of mail you get.  If you start seeing new spams that are scoring below
BAYES_70 or so, learn a few of them.  Every so often learn a few new hams to
keep things balanaced.  You typically will only have to spend a few minutes
a week dealing with this.  If you get bayes trained well, you could turn on
auto-learning again.  But I'm personally nervous doing this, and it isn't
that hard to toss a few messages to bayes every now and then.

That should get bayes on your side pretty quickly.

The next thing that could help you is to enable net tests, specifically the
SURBL checks.  These will catch a lot of your spams.

You might need to be careful with any other net checks.  You have a really
screwy sequence of received headers, with all of those qmail headers between
all the real headers.  I don't know if SA will be able to deal with that and
figure out where your main mail gateway is so that it can determine the
trusted hosts correctly.

        Loren


Re: false positives and negatives

Posted by Chavdar Videff <ch...@mr-bricolage.bg>.
On Tuesday 31 May 2005 05:16, Loren Wilton wrote:
> > 1. At our site we get approx. 1000 spam a week. Most of it is rated below
>
> 2.0
>
> > points and gets through (even if we set required hits to 3 and 2 for
>
> certain
>
> > mailboxes).
>
> I assume you mean here that you have 1000 spam a week leaking through?  Or
> do you mean that you have 1000 spam a week TOTAL and ALL of it gets
> through?
>
> Setting the required hits below 5, or certainly below 4, is not the answer.
> You have something else wrong, I would say severely wrong.  You have Bayes
> turned on, and it should be taking care of the vast majority of this sort
> of thing, if it is properly trained.
>
> If Bayes is improperly trained, it could be causing your problem, by
> claiming that some class (or possibly all) of your spam is really ham, and
> lowering the score.
>
> That is about the limit of the help we can give from what you posted.  If
> you posted a typical spam *with complete headers* and *with the scores you
> got* we would be able to look at it and probably spot some obvious
> problems. As it is, all we can do is guess.
>
>         Loren

Sorry for my late reply - my evening is your morning.
There is 1000 spam a week that leaks through and perhaps another 500-600 that 
get filtered by spamassassin.
If my Bayes is poorly trained what options do I have.
Here is a typical letter that gets through.

===================================================================================
Return-Path: <le...@street67.net>
 Received: from fw.doverie.bg (doh-gw.customer.0rbitel.net [195.24.44.114])
        by mail1.mr-bricolage.bg (8.13.3/8.13.3/Debian-6) with SMTP id 
j4V11DGj014435
        for <ch...@mr-bricolage.bg>; Tue, 31 May 2005 04:01:15 +0300
 Received: (qmail 13680 invoked by uid 507); 31 May 2005 00:58:54 -0000
 Delivered-To: doverie.bg-andrei@doverie.bg
 Received: (qmail 13672 invoked by uid 503); 31 May 2005 00:58:48 -0000
 Received: from lennon@street67.net by fw.doverie.bg by uid 500 with 
qmail-scanner-1.15 
 (f-prot: 3.12.  Clear:. 
 Processed in 12.821956 secs); 31 May 2005 00:58:48 -0000
 Received: from cow100.orbitel.bg (HELO ns.orbitel.bg) (195.24.32.18)
  by 0 with SMTP; 31 May 2005 00:58:20 -0000
 Received: (qmail 607 invoked from network); 31 May 2005 01:01:36 -0000
 Received: from unknown (HELO street67.net) (219.134.152.97)
  by ns.orbitel.bg with SMTP; 31 May 2005 01:01:36 -0000
 Message-ID: <00...@street67.net>
 Date: Mon, 30 May 2005 16:15:11 +1100
 From: "michael torrey" <le...@street67.net>
 User-Agent: QUALCOMM Windows Eudora Version 6.0.0.22
 X-Accept-Language: en-us
 MIME-Version: 1.0
 To: "Elden Irving" <an...@doverie.bg>
 Cc: <ji...@doverie.bg>,
 <do...@doverie.bg>
 Subject: It is all about quality tableets sold at the finest prices.
 Content-Type: text/plain;
  charset="us-ascii"
 Content-Transfer-Encoding: 7bit
 X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on 
        mail1.mr-bricolage.bg
 X-Spam-Level: 
 X-Spam-Status: No, score=0.1 required=2.0 tests=FORGED_RCVD_HELO 
        autolearn=ham version=3.0.2
 Status: R
 X-Status: N
 X-KMail-EncryptionState: 
 X-KMail-SignatureState: 
 X-KMail-MDN-Sent: 
 
At our rxdrug-site, you can choose top-selling rxmeds at a reduced prices.
Legitimate way to e-shoppe for tableets. We provide customers flexible and
reliable distribution services.
======================================================================

Regards

Chavdar Videff

Re: false positives and negatives

Posted by Loren Wilton <lw...@earthlink.net>.
> 1. At our site we get approx. 1000 spam a week. Most of it is rated below
2.0
> points and gets through (even if we set required hits to 3 and 2 for
certain
> mailboxes).

I assume you mean here that you have 1000 spam a week leaking through?  Or
do you mean that you have 1000 spam a week TOTAL and ALL of it gets through?

Setting the required hits below 5, or certainly below 4, is not the answer.
You have something else wrong, I would say severely wrong.  You have Bayes
turned on, and it should be taking care of the vast majority of this sort of
thing, if it is properly trained.

If Bayes is improperly trained, it could be causing your problem, by
claiming that some class (or possibly all) of your spam is really ham, and
lowering the score.

That is about the limit of the help we can give from what you posted.  If
you posted a typical spam *with complete headers* and *with the scores you
got* we would be able to look at it and probably spot some obvious problems.
As it is, all we can do is guess.

        Loren


Re: false positives and negatives

Posted by Craig Jackson <cj...@localsurface.com>.
Chavdar Videff wrote:
> Dear List,
> 
> I know these are subject of the FAQ and the documentation, yet after I read 
> all of it I didn't get an answer to the following questions:
> 
> 1. At our site we get approx. 1000 spam a week. Most of it is rated below 2.0 
> points and gets through (even if we set required hits to 3 and 2 for certain 
> mailboxes).
> 
> 2. Mail composed as HTML is rated as spam for the above reason.
> 
> What can we do to improve the situation and boost the performance of SA.
> 
> I assume that if we set required hits below 5.0, ham messages composed as HTML 
> will be rated as spam. However, the overwhelming number of spam rated below 
> 4, 3, 2 and even 1 points that we receive renders spamassassin useless for 
> our mail-server.
> 
> We sort ham and spam and run sa-learn daily in order to train SA, we feed the 
> low-rated spam and ham that is not rated correctly to sa-learn without any 
> success: most messages (that are repeated) continue to go through.
> 
> Please help.
> 
> Why doesn't sa-learn help. We thought that if we submit to sa-learn a messages 
> that was mistaken, the next time a message that is the same or from the same 
> address will be sorted correctly.
> 

We had the same problem. We did a quick study of the spam and determined 
that most of it is from hardcore spammers and the rest is from 
"spammers" that users signed up for (or nearly signed up for). The 
hardcore spam is knocked out 95% by greylisting. The signed-up spam is 
all HTML and can be identified pretty easily with few false positives by 
  increasing the html test scores a bit, adding a few tests related to 
disclaimers and unsubscribing and best deals, and adding some SA white 
list entries. Now 99% of our spam is gone and it requires very little 
work. This after I have disabled AWL and Bayes in SA. We use Spamcop (in 
Exim) but I disable all of the DNS tests in SA. I think those SA DNS are 
actually very good and may try them. Until now I have been concerned 
with Network traffic.

Good luck,
Craig Jackson

Re: false positives and negatives

Posted by JamesDR <ro...@bellsouth.net>.
Chavdar Videff wrote:
> Dear List,
> 
> I know these are subject of the FAQ and the documentation, yet after I read 
> all of it I didn't get an answer to the following questions:
> 
> 1. At our site we get approx. 1000 spam a week. Most of it is rated below 2.0 
> points and gets through (even if we set required hits to 3 and 2 for certain 
> mailboxes).
> 
> 2. Mail composed as HTML is rated as spam for the above reason.
> 
> What can we do to improve the situation and boost the performance of SA.
> 
> I assume that if we set required hits below 5.0, ham messages composed as HTML 
> will be rated as spam. However, the overwhelming number of spam rated below 
> 4, 3, 2 and even 1 points that we receive renders spamassassin useless for 
> our mail-server.
> 
> We sort ham and spam and run sa-learn daily in order to train SA, we feed the 
> low-rated spam and ham that is not rated correctly to sa-learn without any 
> success: most messages (that are repeated) continue to go through.
> 
> Please help.
> 
> Why doesn't sa-learn help. We thought that if we submit to sa-learn a messages 
> that was mistaken, the next time a message that is the same or from the same 
> address will be sorted correctly.
> 
> 
> Following is the configuration file (debian sid, sendmail, sitewide 
> configuration of SA).
> 
> mail1:/home/chavdar# cat /etc/mail/spamassassin/local.cf
> # This is the right place to customize your installation of SpamAssassin.
> #
> # See 'perldoc Mail::SpamAssassin::Conf' for details of what can be
> # tweaked.
> #
> ###########################################################################
> #
> # rewrite_header Subject *****SPAM*****
> # report_safe 1
> # trusted_networks 10.50
> # lock_method flock
> 
> required_hits 3
> rewrite_subject 1
> report_header 1
> use_terse_report 1
> defang_mime 0
> report_safe 0
> use_bayes 1
> auto_learn 1
> 
> Regards
> 
> Chavdar Videff
> 
Your bayes may be hosed. You may want to tune the set score that 
autolearn learns upon. It also seems you are using an older version of 
SA. Upgrade to take advantage of URI BLs. With out the headers from the 
mails that fp/fn'd it may be hard to guess as to what the real situation is.
HTH


-- 
Thanks,
JamesDR

Re: false positives and negatives

Posted by jdow <jd...@earthlink.net>.
From: "Chavdar Videff" <ch...@mr-bricolage.bg>

> Dear List,
>
> I know these are subject of the FAQ and the documentation, yet after I
read
> all of it I didn't get an answer to the following questions:
>
> 1. At our site we get approx. 1000 spam a week. Most of it is rated below
2.0
> points and gets through (even if we set required hits to 3 and 2 for
certain
> mailboxes).
>
> 2. Mail composed as HTML is rated as spam for the above reason.
>
> What can we do to improve the situation and boost the performance of SA.
>
> I assume that if we set required hits below 5.0, ham messages composed as
HTML
> will be rated as spam. However, the overwhelming number of spam rated
below
> 4, 3, 2 and even 1 points that we receive renders spamassassin useless for
> our mail-server.
>
> We sort ham and spam and run sa-learn daily in order to train SA, we feed
the
> low-rated spam and ham that is not rated correctly to sa-learn without any
> success: most messages (that are repeated) continue to go through.
>
> Please help.
>
> Why doesn't sa-learn help. We thought that if we submit to sa-learn a
messages
> that was mistaken, the next time a message that is the same or from the
same
> address will be sorted correctly.
>
>
> Following is the configuration file (debian sid, sendmail, sitewide
> configuration of SA).
>
> mail1:/home/chavdar# cat /etc/mail/spamassassin/local.cf
> # This is the right place to customize your installation of SpamAssassin.
> #
> # See 'perldoc Mail::SpamAssassin::Conf' for details of what can be
> # tweaked.
> #
>
###########################################################################
> #
> # rewrite_header Subject *****SPAM*****
> # report_safe 1
> # trusted_networks 10.50
> # lock_method flock
>
> required_hits 3
> rewrite_subject 1
> report_header 1
> use_terse_report 1
> defang_mime 0
> report_safe 0
> use_bayes 1
> auto_learn 1
  ^^^^^^^^^^

IMAO that is an utterly darned fool thing to use when coming up from a
cold SpamAssassin start. I've found that a raw SpamAssassin install is
wretched at filtering spam. Using autolearn at that time leads to the
Bayes filter being very poorly trained. There, I got that off my
shoulders. TPTB designing SpamAssassin disagree with me, obviously.
My opinion comes from watching this list for a couple years or so now.
Either auto_learn needs to default off or the spam/ham autolearn
thresholds need to be dramatically changed. (I also note that a fluke
in the scores configuration resulted in Bayes_99 having an absurdly
low score in spite of its being nearly a perfect Spam sign on a well
trained database. (I take it as an indication of a poorly trained
"autolearned" database when they setup their entire scoring set.)

My personal suggestions follow.

1) Nuke your current Bayes.
2) Install carefully selected SARE rule sets. Review and update your
   SARE rules regularly. Update weekly or more often. Review your rule
   sets against offerings at least once a month.
3) Turn off auto_learn or move the ham and spam thresholds for autolearn
   farther away from your spam threshold. (I just fail to see the logic.
   I consider it to be a tool for killing Bayes.)
4) Grit your teeth and use SURBL. (I don't like many black list policies.
   SURBL is quite honorable about theirs.)
5) Manually train on ham and spam per user with per user Bayes. (Shared
   Bayes is often less than useless. One person's desired porno mail is
   another person's extreme spam.)
6) NEVER delete spam. Forward it to the user marked as spam so that it
   can be eliminated after their review. Am ISP should include in the
   default email setup a spam folder with a rule that places the spam
   into that folder. Explain to them why you did this and how to change
   the folder destination into a simple delete it.
7) If you later reenable auto learn do so with extended thresholds.
8) If you manually train do it rigorously for the first few weeks then
   only train Bayes on spam if you happen to notice a low scoring spam
   that is not BAYES_99 and includes more than one or two lines of text.
9) Save all the ham and spam you used for training in case you have to
   rebuild the Bayes in the future. It saves time.
10) If you're in a multi-user environment make it as easy as possible for
   them to move an email from the incoming area to the spam or ham folders
   you provide for the user. Then have a script that performs automatic
   training at sane intervals. As part of training divert the spam, at
   least, to a spam database that can be used to retrain Bayes as quickly
   as possible in case of a glitch.

Thus is the road to very low false negatives and false positives. I tend
to get 700 to 1300 emails per day. If that about 1/3 are spam. (Hey, the
Linux Kernel Mailing List and the Mandriva lists tend to be busy as does
this one. It adds up to a lot of ham in a hurry for me. {^_-}) My FP and
FN levels are on the order of one per thousand. FN is chiefly when a new
spam address appears and the spammer uses new techniques to hide the
spamminess. Pending SURBL catches I build a quick rule for it. My FP rate
lives around 0 and 10 per thousand almost all from either patches or bug
reports on LKML or the occasional AOL email that uses an email relay
that is new and not in the test for legitimate AOL mail that is used to
test for bogus spam containing AOL addresses.

What I do about the FPs is simple. I sort all the spam into a spam folder.
Then I sort by subject. Since the subject markup gives a three digit
score it is easy for me to look at the first dozen or so entries to see
if any of the low scoring spam was ham. Above about "12" here is never
never land - I never see ham up there. (Or if it's ham I don't want to
see it. {^_-})




Re: false positives and negatives

Posted by JamesDR <ro...@bellsouth.net>.
Chavdar Videff wrote:
> Dear List,
> 
> I know these are subject of the FAQ and the documentation, yet after I read 
> all of it I didn't get an answer to the following questions:
> 
> 1. At our site we get approx. 1000 spam a week. Most of it is rated below 2.0 
> points and gets through (even if we set required hits to 3 and 2 for certain 
> mailboxes).
> 
> 2. Mail composed as HTML is rated as spam for the above reason.
> 
> What can we do to improve the situation and boost the performance of SA.
> 
> I assume that if we set required hits below 5.0, ham messages composed as HTML 
> will be rated as spam. However, the overwhelming number of spam rated below 
> 4, 3, 2 and even 1 points that we receive renders spamassassin useless for 
> our mail-server.
> 
> We sort ham and spam and run sa-learn daily in order to train SA, we feed the 
> low-rated spam and ham that is not rated correctly to sa-learn without any 
> success: most messages (that are repeated) continue to go through.
> 
> Please help.
> 
> Why doesn't sa-learn help. We thought that if we submit to sa-learn a messages 
> that was mistaken, the next time a message that is the same or from the same 
> address will be sorted correctly.
> 
> 
> Following is the configuration file (debian sid, sendmail, sitewide 
> configuration of SA).
> 
> mail1:/home/chavdar# cat /etc/mail/spamassassin/local.cf
> # This is the right place to customize your installation of SpamAssassin.
> #
> # See 'perldoc Mail::SpamAssassin::Conf' for details of what can be
> # tweaked.
> #
> ###########################################################################
> #
> # rewrite_header Subject *****SPAM*****
> # report_safe 1
> # trusted_networks 10.50
> # lock_method flock
> 
> required_hits 3
> rewrite_subject 1
> report_header 1
> use_terse_report 1
> defang_mime 0
> report_safe 0
> use_bayes 1
> auto_learn 1
> 
> Regards
> 
> Chavdar Videff
> 
Taking a second look at what you posted, it also looks like your config 
is incorrect for version 3.0.x
<snip>
# rewrite_header Subject *****SPAM*****
# report_safe 1
# trusted_networks 10.50
# lock_method flock
</snip>
iirc is 3.0.x syntax which would make:
<snip>
rewrite_subject 1
report_header 1
use_terse_report 1
defang_mime 0
auto_learn 1
</snip>
invalid.

Check your docs for the correct syntax for those config directives.

-- 
Thanks,
JamesDR