You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Karsten Bräckelmann <gu...@rudersport.de> on 2011/08/16 00:06:05 UTC

Re: Inconsistent spam scores between spam headers and rewritten subject line.

On Tue, 2011-08-16 at 01:07 +0930, Rodney Baker wrote:
> On Tue, 16 Aug 2011 00:48:13 Bowie Bailey wrote:

> > >    * ^Subject.*SPAM\([0-9]{1,3}\.[0-9]\).*
> > >    $HOME/Maildir/.Spam//
> > > 
> > > I'm attempting to filter on the modified subject line (which for some
> > > reason isn't working - that rule never seems to match and spam never
> > > gets moved into the Spam folder, even though I've tested the regex
> > > manually). I thought of filtering on the X-Spam-Status header instead,
> > > but when I had a look at a message that was marked as Spam (according to
> > > the subject line) I found something rather strange...

Yes, filtering on the SA X-Spam Status or Level headers is the way to
go. After you found and fixed where SA gets called a second time
(actually the first time), these won't be harmed and overwritten -- and
useful for filtering.

Anyway, the secret why the above procmail recipe doesn't work is simply,
because procmail uses a rather limited sub-set of REs and its own
flavor. It's not PCRE.

In particular procmail does not understand {x,y} range quantifiers, but
treats that part as a plain string to match. Which doesn't.
(Caveat: From memory, not actually looked it up again for verification.)


> > >     3.8 KB_DATE_CONTAINS_TAB   KB_DATE_CONTAINS_TAB
> > >     3.0 IMPOTENCE              BODY: Impotence cure
> > >    -0.0 BAYES_20               BODY: Bayes spam probability is 5 to 20%
> > >                                [score: 0.1050]
> > >     2.0 KB_FAKED_THE_BAT       KB_FAKED_THE_BAT
> > >     1.2 RDNS_NONE              Delivered to internal network by a host with no
> > >                                rDNS

Oh, yeah, these do ring quite some bells... ;)

After you fixed your mail processing chain to not have SA chew twice on
the spam -- you should manually train Bayes, feeding it a lot of hand
classified spam, and possibly ham. Check your 'sa-learn --dump magic'
numbers. The Bayes score of 0.1 is way out of line.

Note though, that a previous site-wide SA filter might use a site-wide
user, not the one owning the procmail recipe. Thus Bayes scores might
suddenly change once it's run per user. Check the numbers and
performance for the user you'll use after fixing the chain issue.


> > You need to fix whatever is causing the message to be scanned twice.
> 
> OK - that makes sense. Now I'm wondering if there is a global mail config 
> somewhere that is routing the message through SA, and then my local 
> .procmailrc is doing it again. Time to go digging...

Site-wide /etc/procmailrc, SMTP server milter, transport or similar, or
even something like Amavis in the chain?

> That then leaves the question as to why my procmail recipe isn't triggering on 
> the rewritten subject, but that is probably not for this list. 

It's sufficiently related. ;)  See above.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Inconsistent spam scores between spam headers and rewritten subject line.

Posted by Bowie Bailey <Bo...@BUC.com>.
On 8/16/2011 8:55 AM, Rodney Baker wrote:
> On Tue, 16 Aug 2011 07:36:05 Karsten Bräckelmann wrote:
>
>> After you fixed your mail processing chain to not have SA chew twice on
>> the spam -- you should manually train Bayes, feeding it a lot of hand
>> classified spam, and possibly ham. Check your 'sa-learn --dump magic'
>> numbers. The Bayes score of 0.1 is way out of line.
> Agreed. I do run sa-learn --spam (actually now have it scheduled to run weekly 
> on a folder into which I drop all the non-classified spam messages) and --ham 
> (on a folder with messages that were false-positives).


When you are trying to fix a Bayes problem, it can be useful to feed it
as much as possible.  Put *all* your ham and *all* your spam (properly
classified or not) into those folders and let Bayes learn from it.

-- 
Bowie

Re: Inconsistent spam scores between spam headers and rewritten subject line.

Posted by Rodney Baker <ro...@jeremiah31-10.net>.
On Tue, 16 Aug 2011 07:36:05 Karsten Bräckelmann wrote:
> On Tue, 2011-08-16 at 01:07 +0930, Rodney Baker wrote:
> > On Tue, 16 Aug 2011 00:48:13 Bowie Bailey wrote:
> > > >    * ^Subject.*SPAM\([0-9]{1,3}\.[0-9]\).*
> > > >    $HOME/Maildir/.Spam//
> > > > 
> > > > I'm attempting to filter on the modified subject line (which for some
> > > > reason isn't working - that rule never seems to match and spam never
> > > > gets moved into the Spam folder, even though I've tested the regex
> > > > manually). I thought of filtering on the X-Spam-Status header
> > > > instead, but when I had a look at a message that was marked as Spam
> > > > (according to the subject line) I found something rather strange...
> 
> Yes, filtering on the SA X-Spam Status or Level headers is the way to
> go. After you found and fixed where SA gets called a second time
> (actually the first time), these won't be harmed and overwritten -- and
> useful for filtering.
> 
> Anyway, the secret why the above procmail recipe doesn't work is simply,
> because procmail uses a rather limited sub-set of REs and its own
> flavor. It's not PCRE.
> 
> In particular procmail does not understand {x,y} range quantifiers, but
> treats that part as a plain string to match. Which doesn't.
> (Caveat: From memory, not actually looked it up again for verification.)

Ah, thankyou. Despite googling for lots of stuff on procmail I've not been 
able to find a definitive reference for what can and can't be used in a 
procmail recipe. Maybe I just haven't use the right search terms (or maybe I 
just haven't understood what I've read). Anyway, thanks for the clarification.

> 
> > > >     3.8 KB_DATE_CONTAINS_TAB   KB_DATE_CONTAINS_TAB
> > > >     3.0 IMPOTENCE              BODY: Impotence cure
> > > >    
> > > >    -0.0 BAYES_20               BODY: Bayes spam probability is 5 to
> > > >    20%
> > > >    
> > > >                                [score: 0.1050]
> > > >     
> > > >     2.0 KB_FAKED_THE_BAT       KB_FAKED_THE_BAT
> > > >     1.2 RDNS_NONE              Delivered to internal network by a
> > > >     host with no
> > > >     
> > > >                                rDNS
> 
> Oh, yeah, these do ring quite some bells... ;)
> 
> After you fixed your mail processing chain to not have SA chew twice on
> the spam -- you should manually train Bayes, feeding it a lot of hand
> classified spam, and possibly ham. Check your 'sa-learn --dump magic'
> numbers. The Bayes score of 0.1 is way out of line.

Agreed. I do run sa-learn --spam (actually now have it scheduled to run weekly 
on a folder into which I drop all the non-classified spam messages) and --ham 
(on a folder with messages that were false-positives).
 
> 
> Note though, that a previous site-wide SA filter might use a site-wide
> user, not the one owning the procmail recipe. Thus Bayes scores might
> suddenly change once it's run per user. Check the numbers and
> performance for the user you'll use after fixing the chain issue.
> 
> > > You need to fix whatever is causing the message to be scanned twice.
> > 
> > OK - that makes sense. Now I'm wondering if there is a global mail config
> > somewhere that is routing the message through SA, and then my local
> > .procmailrc is doing it again. Time to go digging...
> 
> Site-wide /etc/procmailrc, SMTP server milter, transport or similar, or
> even something like Amavis in the chain?

There is no /etc/procmailrc, no milter that I'm aware of, running 
fetchmail/sendmail/dovecot. This machine doubles as my home mail server/file 
server and desktop machine. The only reason I'm running IMAP is so that I can 
access the same mail from my laptop or netbook when I need to (and I used to 
run squirrelmail to allow access remotely via https webmail, but not any 
more).
 
> 
> > That then leaves the question as to why my procmail recipe isn't
> > triggering on the rewritten subject, but that is probably not for this
> > list.
> 
> It's sufficiently related. ;)  See above.

Thanks again. :-)

-- 
======================================================
Rodney Baker
rodney@jeremiah31-10.net
web: www.jeremiah31-10.net
======================================================