You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Tim Boyer <ti...@denmantire.com> on 2006/11/01 14:14:39 UTC

Inconsistent scoring

I've been using SA for years.  I'm running 3.1.6 on a Red Hat box, and 99%
of the time, all is well.

Last week I added a rule to tag those annoying .gif pump-and-dump emails.
Nothing fancy:

rawbody IMG_SRC_CID         /src\=(\"c|c)id\:/i
score IMG_SRC_CID       2.0

Most of the time it works fine.  However, occasionally, I'll get an email
that ONLY sees that rule.  I'm using MimeDefang to rewrite the headers, and
all it shows is

X-Spam-Score: 2 (**) IMG_SRC_CID

But when I do a spamassassin --debug<test with the message, it finds all
kinds of fun things:


Content analysis details:   ( 6.6 points, 9.0 required)

 pts rule name              description
---- ---------------------- ------------------------------------------------
--
 0.1 FORGED_RCVD_HELO       Received: contains a forged HELO
 1.5 RCVD_NUMERIC_HELO      Received: contains an IP address used for HELO
-0.3 BAYES_40               BODY: Bayesian spam probability is 20 to 40%
                            [score: 0.2631]
 1.9 HTML_IMAGE_ONLY_28     BODY: HTML: images with 2400-2800 bytes of words
 0.0 HTML_MESSAGE           BODY: HTML included in message
 1.4 HTML_10_20             BODY: Message is 10% to 20% HTML
 0.0 MIME_HTML_ONLY         BODY: Message only has text/html MIME parts
 2.0 IMG_SRC_CID            RAW: cid in body

The very next message is the same kind of scam, but sees everything:

X-Spam-Score: 7.967 (*******)
BAYES_00,DNS_FROM_RFC_ABUSE,FORGED_RCVD_HELO,HTML_
00_10,HTML_MESSAGE,IMG_SRC_CID,MIME_HTML_ONLY,RCVD_NUMERIC_HELO


So what obvious mistake am I making?  Thanks for any help...

--
tim boyer
tim@denmantire.com


RE: Inconsistent scoring

Posted by Tim Boyer <ti...@denmantire.com>.
> 
> This seems rather odd.  I suppose you did lint your rules to 
> make sure that you don't have a problem somewhere?  It is 
> known that SA can do things like dropping most of the rules 
> file following a rule with an error in it.
> 

Yup; no lint problems at all.

> Maybe you are using Amvis-new or one of the other tools that 
> does its own header rewriting in at least some cases?
> 

MIMEDefang, but I can't see it doing this.

> I do have a suggestion for improving your rule though.  There 
> are several things that aren't as efficient as they should 
> be.  Instead of
> 
> > rawbody IMG_SRC_CID         /src\=(\"c|c)id\:/i
> 
> do
> 
> > rawbody IMG_SRC_CID         /src="?cid:/i
> 

Thanks much - I need all the perl help I can get. :)

-- tim --


RE: Inconsistent scoring

Posted by "John D. Hardin" <jh...@impsec.org>.
On Wed, 1 Nov 2006, Mark wrote:

> > > rawbody IMG_SRC_CID         /src\s*=\s*"?cid:/i
> 
> Well, that matches newlines, too (really, even without /m). So, you want:
> 
> rawbody IMG_SRC_CID         /src[ \t]*=[ \t]*"?cid:/i

Why? Newlines there are syntactically valid, are they not?

--
 John Hardin KA7OHZ    ICQ#15735746    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174    pgpk -a jhardin@impsec.org
 key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  If someone has a gun and is trying to kill you, it would be
  reasonable to shoot back with your own gun.
                                      -- the Dalai Lama, May 15, 2001
-----------------------------------------------------------------------
 6 days until the campaign ads stop


RE: Inconsistent scoring

Posted by Mark <ad...@asarian-host.net>.
> -----Original Message-----
> From: Loren Wilton [mailto:lwilton@earthlink.net] 
> Sent: woensdag 1 november 2006 15:11
> To: users@spamassassin.apache.org
> Subject: Re: Inconsistent scoring
> 
> 
> Also, while I've never seen it done, I think it is 
> theoretically possible to have spaces on either side
> of the equal sign. So the regex really should 
> probably be:
> 
> > rawbody IMG_SRC_CID         /src\s*=\s*"?cid:/i

Well, that matches newlines, too (really, even without /m). So, you want:

rawbody IMG_SRC_CID         /src[ \t]*=[ \t]*"?cid:/i

And if we're really nitpicky, we want to match "src" on a boundary:

rawbody IMG_SRC_CID         /\bsrc[ \t]*=[ \t]*"?cid:/i

- Mark


Re: Inconsistent scoring

Posted by Loren Wilton <lw...@earthlink.net>.
This seems rather odd.  I suppose you did lint your rules to make sure that 
you don't have a problem somewhere?  It is known that SA can do things like 
dropping most of the rules file following a rule with an error in it.

Maybe you are using Amvis-new or one of the other tools that does its own 
header rewriting in at least some cases?

I do have a suggestion for improving your rule though.  There are several 
things that aren't as efficient as they should be.  Instead of

> rawbody IMG_SRC_CID         /src\=(\"c|c)id\:/i

do

> rawbody IMG_SRC_CID         /src="?cid:/i

You don't need the alternation in there, all you really want is an optional 
quote mark, and following the quote with a question mark does that.  Even if 
you needed an alternation, it would be better to use a "non capturing" form 
of grouping: (?:blah) rather than just (blah).  This reduces the overhead 
for perl of saving the string that matches inside the parends in case you 
want to use it later in the regex for some reason.

Also, while I've never seen it done, I think it is theoretically possible to 
have spaces on either side of the equal sign.  So the regex really should 
probably be:

> rawbody IMG_SRC_CID         /src\s*=\s*"?cid:/i


        Loren



Re: Inconsistent scoring

Posted by Theo Van Dinter <fe...@apache.org>.
On Wed, Nov 01, 2006 at 08:14:39AM -0500, Tim Boyer wrote:
> Last week I added a rule to tag those annoying .gif pump-and-dump emails.
> Nothing fancy:
> rawbody IMG_SRC_CID         /src\=(\"c|c)id\:/i

There are several issues with this rule IMO, but there's already a very
similar rule available via sa-update:

 16.856  20.0630   0.3170    0.984   0.77    1.00  __TVD_INT_CID

which shows that it hits a lot of ham (0.32%), but also hits 20% of spam.
It's good enough for a meta dependency, but not necessarily as a rule for
itself, though YMMV.

-- 
Randomly Selected Tagline:
"It is sometimes fun to scare people... Especially Matt." - Michelle