You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Greg Troxel <gd...@ir.bbn.com> on 2008/05/30 22:18:14 UTC

double rules hits?

I have been seeing several occasions where two rules hit for the same
underlying issue, and it seems that this isn't really desired.

Example 1: I got ham that had a line with

  dig [some.isp.name.].isphosts.junkemailfilter.com

in it.  It seems giving it 2.3 points for SPOOF_COM2COM is fair, but
that turns out to be 4.3 because SPOOF_COM2OTH gets 2.0.  This ended up
as a FP because I filter to spam folder at 1, preferring to misclassify
some list mail to keep my inbox as clean as I can.


X-Spam-Status: Yes, score=1.7 required=1.0 tests=AWL,BAYES_00,HTML_MESSAGE,
	SPOOF_COM2COM,SPOOF_COM2OTH autolearn=no version=3.2.4
X-Spam-Report: 
	*  2.0 SPOOF_COM2OTH URI: URI contains ".com" in middle
	*  2.3 SPOOF_COM2COM URI: URI contains ".com" in middle and end
	* -2.6 BAYES_00 BODY: Bayesian spam probability is 0 to 1%
	*      [score: 0.0000]
	*  0.0 HTML_MESSAGE BODY: HTML included in message
	*  0.0 AWL AWL: From: address is in the auto white-list

Example 2: blacklists

Here, the mail is spam from a bad source, but with two lists more or
less claiming this it doesn't seem quite right to add the scores.  In
this case spamcop says the machine has sent spam, and spamhaus that it's
in XBL for being a compromised box.

X-Spam-Status: Yes, score=3.6 required=1.0 tests=AWL,BAYES_50,HTML_MESSAGE,
        RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_XBL,RDNS_NONE autolearn=spam version=3.2.4
X-Spam-Report: 
        *  0.0 HTML_MESSAGE BODY: HTML included in message
        *  0.0 BAYES_50 BODY: Bayesian spam probability is 40 to 60%
        *      [score: 0.5676]
        *  4.0 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
        *      [Blocked - see <http://www.spamcop.net/bl.shtml?123.142.103.19>]
        *  3.0 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL
        *      [123.142.103.19 listed in zen.spamhaus.org]
        *  0.1 RDNS_NONE Delivered to trusted network by a host with no rDNS
        * -3.6 AWL AWL: From: address is in the auto white-list



So, I realize this would be complicated, but I wonder about having a
score combining function for tests that are making essentially the same
claim.  Perhaps the 4 and 3 above should combine to 5, and the
SPOOF_COM2* should just be 2.3.

Re: double rules hits?

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2008-05-30 at 16:18 -0400, Greg Troxel wrote:
> I have been seeing several occasions where two rules hit for the same
> underlying issue, and it seems that this isn't really desired.
> 
> Example 1: I got ham that had a line with
> 
>   dig [some.isp.name.].isphosts.junkemailfilter.com
> 
> in it.  It seems giving it 2.3 points for SPOOF_COM2COM is fair, but
> that turns out to be 4.3 because SPOOF_COM2OTH gets 2.0.  This ended up
> as a FP because I filter to spam folder at 1,  [...]

That is *really* drastic. Much too low, IMHO.

> preferring to misclassify some list mail  [...]

Do not filter the SA list. We are talking about spam. You will get FPs.

> to keep my inbox as clean as I can.

Hmm, why do mailing lists end up in your Inbox anyway, rather than
filtering / moving them into dedicated mail folders? In most cases doing
so without processing these messages by SA is a sensible decision...

> X-Spam-Status: Yes, score=1.7 required=1.0 tests=AWL,BAYES_00,HTML_MESSAGE,
> 	SPOOF_COM2COM,SPOOF_COM2OTH autolearn=no version=3.2.4

This is only a FP, because *you* deliberately choose it to be. A score
of 1.7 hardly can be a reason for complaint about FPs.


> Example 2: blacklists
[...]
> So, I realize this would be complicated, but I wonder about having a
> score combining function for tests that are making essentially the same
> claim.  Perhaps the 4 and 3 above should combine to 5, and the
> SPOOF_COM2* should just be 2.3.

It isn't complicated. You can easily set up meta rules, that
"correct" (reduce in your case) the score.

However, please do note, that generally, this *is* a strong sign for
spammyness -- stronger, than the sum of both individually. Hence there
are stock rules like DIGEST_MULTIPLE...


Also, regarding both examples: The scores have been set based on some
really long and thorough process investigating large ham and spam
corpora. Especially, if two rules are similar in nature and likely to
trigger both, the *sum* of them is what has proven to be most effective
in identifying spam while still maintaining a seriously low FP rate. In
a nutshell: The sum is on purpose.

Granted, this is with the default threshold of 5, not with a custom
required_score of 1...  Seriously.

  guenther


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}