You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Jesse Houwing <j....@rulesemporium.com> on 2004/07/22 00:09:14 UTC

SA 2.63 -> 3.0 causes degraded rule efficiency.

Hmz.... I'm having a lot of problems with a couple of rules that seem to
have completely broken after upgrading to spamassassin 3.0.

In 2.63 SARE_URI_EQUALS got  191 spam / 0 ham
Now the same masscheck yields:      10 results on the same corpus.

I just don't understand what's going on....

This is the rule in question:

uri       SARE_URI_EQUALS
m{^(?:(?:h|%[46]8)(?:t|%[57]4){2}(?:p|%[57]0)(?:s|%[57]3)?(?::|%3a)?(?:%5c|\\|%2f|/){0,2})[^/\?;]+=(?!(?:..)?$).*}i
describe  SARE_URI_EQUALS          Trying to hide the real URL with IE
parsing bug
score     SARE_URI_EQUALS          2.5

Any ideas why this is happening?

It probably has something to do with the "improved" way of parsing the
body for uri's, but somehow it isn't improving anything.

I'm seeing this with other rules aswell. For example the an
ILLEGAL_COLOR rule I've been working on always caugth up to 1700 hits in
my corpus, now I'm getting only 8. They're all valid and all, but it
isn't pleasing so to speak. I'd hate to have to set these rules to full
to see if they do better.... The Illegal color rule is darn ugly at the
moment, but here it is (should probably be converted to an eval test,
but this was quicker to see if it would work):

rawbody   SARE_ILLEGAL_COLOR
/color\s{0,10}(?::|=(?:3d)?(?!3d))(?:[\s\'\"]){0,10}(?![\s\'\">])(?!$|&quot;|\#?(?!\#)(?:[a-f0-9]{3}(?:\W|$)|[a-f0-9]{6}0?(?:\W|$))|rgb\(\s{0,10}(?:25[0-5]|2[0-4][0-9]|1?[0-9]?[0-9])\s{0,10},\s{0,10}(?:25[0-5]|2[0-4][0-9]|1?[0-9]?[0-9])\s{0,10},\s{0,10}(?:25[0-5]|2[0-4][0-9]|1?[0-9]?[0-9]\s{0,10})\)|rgb\(\s{0,10}1?[0-9]?[0-9]%\s{0,10},\s{0,10}1?[0-9]?[0-9]%\s{0,10},\s{0,10}1?[0-9]?[0-9]%\)|transparent|Black|White|Red|Yellow|Lime|Aqua|Blue|Fuchsia|Gr[ae]y|Silver|Maroon|Olive|Green|Teal|Navy|Purple|AliceBlue|AliceBlue|AntiqueWhite|Aqua|Aquamarine|Azure|Beige|Bisque|Black|BlanchedAlmond|Blue|BlueViolet|Brown|BurlyWood|CadetBlue|Chartreuse|Chocolate|Coral|CornflowerBlue|Cornsilk|Crimson|Cyan|DarkBlue|DarkCyan|DarkGoldenrod|DarkGr[ea]y|DarkGreen|DarkKhaki|DarkMagenta|DarkOliveGreen|DarkOrange|DarkOrchid|DarkRed|DarkSalmon|DarkSeaGreen|DarkSlateBlue|DarkSlateGray|DarkTurquoise|DarkViolet|DeepPink|DeepSkyBlue|DimGray|DodgerBlue|FireBrick|FloralWhite|ForestGreen|Fuchsia|Gainsboro|Gh
ostWhite|Gold|Goldenrod|Gr[ae]y|Green|GreenYellow|Honeydew|HotPink|IndianRed|Indigo|Ivory|Khaki|Lavender|LavenderBlush|LawnGreen|LemonChiffon|LightBlue|LightCoral|LightCyan|LightGoldenrodYellow|LightGreen|LightGrey|LightPink|LightSalmon|LightSeaGreen|LightSkyBlue|LightSlateGray|LightSteelBlue|LightYellow|Lime|LimeGreen|Linen|Magenta|Maroon|MediumAquamarine|MediumBlue|MediumOrchid|MediumPurple|MediumSeaGreen|MediumSlateBlue|MediumSpringGreen|MediumTurquoise|MediumVioletRed|MidnightBlue|MintCream|MistyRose|Moccasin|NavajoWhite|Navy|OldLace|Olive|OliveDrab|Orange|OrangeRed|Orchid|PaleGoldenrod|PaleGreen|PaleTurquoise|PaleVioletRed|PapayaWhip|PeachPuff|Peru|Pink|Plum|PowderBlue|Purple|Red|RosyBrown|RoyalBlue|SaddleBrown|Salmon|SandyBrown|SeaGreen|Seashell|Sienna|Silver|SkyBlue|SlateBlue|SlateGray|Snow|SpringGreen|SteelBlue|Tan|Teal|Thistle|Tomato|Turquoise|Violet|Wheat|White|WhiteSmoke|Yellow|YellowGreen|ActiveBorder|ActiveCaption|AppWorkspace|Background|Buttonface|ButtonHighligh
t|ButtonShadow|ButtonText|CaptionText|GrayText|Highlight|HighlightTex
|InfoText|Menu|MenuText|Scrollbar|ThreeDDarkShadow|ThreeDFace|ThreeDHighlight|ThreeDLightShadow|ThreeDShadow|Window(?:Frame|WindowText)?).{1,10}/i
score     SARE_ILLEGAL_COLOR       1.666
describe  SARE_ILLEGAL_COLOR       Uses illegal color code

for catching things like:

color: #ff$f%f;
color="cooking"
color="RNDCLR[]" etc...

As I said before... any help, hints on whats going on and such is really
appriciated.

Jesse
SARE Ninja

Re: SA 2.63 -> 3.0 causes degraded rule efficiency.

Posted by Daniel Quinlan <qu...@pathname.com>.

Subject: Re: SA 2.63 -> 3.0 causes degraded rule efficiency.
References: <40...@rulesemporium.com>
--text follows this line--
[ I replied to spamassassin-users only, this would be fine for -dev, but
  there's no reason to make people read the same message twice unless
  it's something like an announcement. ]

Jesse Houwing <j....@rulesemporium.com> writes:

> Hmz.... I'm having a lot of problems with a couple of rules that seem to
> have completely broken after upgrading to spamassassin 3.0.

I think we need smaller and easier to digest test rules (and maybe an
example message) for the rawbody and uri issues.  :-)

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: SA 2.63 -> 3.0 causes degraded rule efficiency.

Posted by Loren Wilton <lw...@earthlink.net>.

A spam to trigger the test.

        Loren

Re: SA 2.63 -> 3.0 causes degraded rule efficiency.

Posted by Daniel Quinlan <qu...@pathname.com>.

"Jesse Houwing" <to...@chello.nl> writes:

> I did a quick grep through my corpus, but it turned out that there
> actually are just 10 such urls in there. The other hits were on
> messages that had the . encoded as =3e. But I'm afraid that to catch
> those I'd have to make this rule full (yuk!).

Yeah, I think for 3.1 (in our copious spare development time), we should
probably clean-up the rule types a bit.  This has been discussed before,
but the basics would be:

  - undecoded type ("undecoded", new type)
  - decoded type ("rawbody", probably rename to "decoded" for clarity,
    but support old name for backwards compatibility)
  - rendered type ("body", keep old name)

"full" should probably just remain as the entire pristine message.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: SA 2.63 -> 3.0 causes degraded rule efficiency.

Posted by Jesse Houwing <to...@chello.nl>.

-----Original Message-----
From: Theo Van Dinter <fe...@kluge.net>
To: Jesse Houwing <j....@rulesemporium.com>
Cc: spamassassin-dev@incubator.apache.org
Date: Thu, 22 Jul 2004 03:23:08 -0400
Subject: Re: SA 2.63 -> 3.0 causes degraded rule efficiency.

> On Thu, Jul 22, 2004 at 08:07:34AM +0200, Jesse Houwing wrote:
> > it is abused quite often in spam. Any chars before the = sign are 
> > discarted and the hostname after the is is used instead, but to the
> user 
> > the host before the = is shown (nifty).
> 
> Heh.  Neat.  IE++  <G>

Qute isn't it ;)

> > But it seesm to do it too harshly, I'll try to find an example from
> my 
> > corpus that should be tagged, but isn't in this case.
> 
> Ok, I'd appreciate that.  Right now, I tried:
> 
> http://penistone.opoloveok=com/3/

I did a quick grep through my corpus, but it turned out that there actually 
are just 10 such urls in there. The other hits were on messages that had the 
. encoded as =3e. But I'm afraid that to catch those I'd have to make this 
rule full (yuk!).

> and that has the rule hit in both 2.6 and 3.0.  If I encode in QP and
> change = to =3D, and also tried a base64 encoding, those also let both
> version's rules hit.  I did a quick look around in my corpus for a spam
> with an appropriate URL, but didn't see one.

I seem to have only 10, but I've had a lot of people who asked for a few 
updates/fixes telling me they had lots of hits, so I'll  keep the rule in 
it's current form.

Jesse

Re: SA 2.63 -> 3.0 causes degraded rule efficiency.

Posted by Theo Van Dinter <fe...@kluge.net>.

On Thu, Jul 22, 2004 at 08:07:34AM +0200, Jesse Houwing wrote:
> it is abused quite often in spam. Any chars before the = sign are 
> discarted and the hostname after the is is used instead, but to the user 
> the host before the = is shown (nifty).

Heh.  Neat.  IE++  <G>

> But it seesm to do it too harshly, I'll try to find an example from my 
> corpus that should be tagged, but isn't in this case.

Ok, I'd appreciate that.  Right now, I tried:

http://penistone.opoloveok=com/3/

and that has the rule hit in both 2.6 and 3.0.  If I encode in QP and
change = to =3D, and also tried a base64 encoding, those also let both
version's rules hit.  I did a quick look around in my corpus for a spam
with an appropriate URL, but didn't see one.

-- 
Randomly Generated Tagline:
Hey, if pi == 3, and three == 0, does that make pi == 0?  :-)
              -- Larry Wall in <19...@wall.org>

Re: SA 2.63 -> 3.0 causes degraded rule efficiency.

Posted by Jesse Houwing <j....@rulesemporium.com>.

Theo Van Dinter wrote:

>On Thu, Jul 22, 2004 at 12:09:14AM +0200, Jesse Houwing wrote:
>  
>
>>This is the rule in question:
>>
>>uri       SARE_URI_EQUALS
>>m{^(?:(?:h|%[46]8)(?:t|%[57]4){2}(?:p|%[57]0)(?:s|%[57]3)?(?::|%3a)?(?:%5c|\\|%2f|/){0,2})[^/\?;]+=(?!(?:..)?$).*}i
>>    
>>
>
>Hrm.  I have no idea what this is actually looking trying to
>match.  The first (?: bit isn't necessary, btw.  Looks like an
>URL with a = somewhere in the host section?  ie: something like
>'http://penistone=2eopoloveok=2ecom/3/' in a quoted-printable part?
>(this is the only set of matches I could find with your RE)
>  
>
No it looks for any uri with a = in the hostname (and excludes the 
quoted printable =) so:

http://www.iamahost=butthisismyrealname.com/ would match
http://www.butthisismyreal= would not,
neither would http://www.butthisismyreal=20

This is an internet explorer parsing bug I'm trying to detect here, and 
it is abused quite often in spam. Any chars before the = sign are 
discarted and the hostname after the is is used instead, but to the user 
the host before the = is shown (nifty).

>If not, please post an example and I'll be happy to help debug.
>(I don't think this is a 3.0 bug though.  See below.)
>
>If so, however: yeah, that'll be different.  In 2.6:
>
>http://penistone=2eopoloveok=2ecom/3/
>
>vs 3.0:
>
>http://penistone.opoloveok.com/3/
>
>which is caused by 2.6 doing a very half-assed attempt at decoding the
>quoted-printable part, so you get the QP bits in the URI.  3.0 does the
>decoding properly (thanks total MIME parser rewrite!), so you end up
>with the URI you're supposed to get, properly decoded.
>
>Specifically, in PerMsgStatus::get_decoded_body_text_array(), which 2.6x
>uses to get the uri list from, the un-quoted-printable code is:
>
>    s/\=([0-9A-F]{2})/chr(hex($1))/ge;
>
>which clearly has one flaw: it's looking for case-sensitive A-F!  D'oh!
>Therefore, it doesn't match the URI above (uses lowercase).  3.0 does
>the right thing here. :)
>
But it seesm to do it too harshly, I'll try to find an example from my 
corpus that should be tagged, but isn't in this case.

Jesse

Re: SA 2.63 -> 3.0 causes degraded rule efficiency.

Posted by Loren Wilton <lw...@earthlink.net>.

I assume we did our standard MUA tests and found lowercase is decoded.
A quick check now shows it is decoded on:

Mutt
Exchange (webmail)
Apple Mail

Since Exchange does it, I assume OE and O will do it too.  I don't have
access to a box with those installed right now though.

I do and it does.  I also remember the initial discussions on this change,
and at the time quite a number of clients were found that would decode qp
regardless of case.


> No more than that.

I believe that is correct, yes.  And that's as far as we go in 3.0
(if I replace the = with =3D):

http://penistone=2eopoloveok=2ecom/3/

This bothers me.  As best I recall reading the discussions, it turned out
that a number of clients would recursively resolve qp until it couldn't be
done anymore, then use the result.  Perhaps though I'm misremembering and it
was only browsers that would do this and not MUAs.


In any case, I believe qp is relatively immaterial in the case in hand.  The
original test was to catch asian spammer domains that put an equal sign in
the middle of the domain name.  This is something that comes and goes as
sort of a fad.

        Loren

Re: SA 2.63 -> 3.0 causes degraded rule efficiency.

Posted by Theo Van Dinter <fe...@kluge.net>.

On Wed, Jul 21, 2004 at 06:40:47PM -0700, Dan Quinlan wrote:
> I think this is exactly what we do right now, so no problem.  We might
> want to double check some common MUAs to make sure they do actually decode
> lowercase hex, though.

Yeah, and I'm pretty sure we did.  I didn't know QP required uppercase,
thanks for the reference.  As for what we do, 3.0 only decodes parts
that specify their encoding.  2.6 wasn't quite as precise (for instance,
it didn't deal with messages that had both base64 and qp parts...)

BTW, the lowercase thing was added:

r6615 | jm | 2004-02-11 02:17:40 -0500 (Wed, 11 Feb 2004) | 1 line
QP decoding should deal with lowercase QP as well

I assume we did our standard MUA tests and found lowercase is decoded.
A quick check now shows it is decoded on:

Mutt
Exchange (webmail)
Apple Mail

Since Exchange does it, I assume OE and O will do it too.  I don't have
access to a box with those installed right now though.

> One thing I think you said did give me pause.  I'm pretty sure that:
> 
>   (in a quoted-printable part)
>   http://penistone=3D2eopoloveok=3D2ecom/3/
> 
> should only decode to
> 
>   http://penistone=2eopoloveok=2ecom/3/
> 
> No more than that.

I believe that is correct, yes.  And that's as far as we go in 3.0
(if I replace the = with =3D):

http://penistone=2eopoloveok=2ecom/3/

-- 
Randomly Generated Tagline:
"the curls in your keyboard cord are losing electricity."
         - Today's BOFH Excuse

Re: SA 2.63 -> 3.0 causes degraded rule efficiency.

Posted by Daniel Quinlan <qu...@pathname.com>.

Theo Van Dinter <fe...@kluge.net> writes:

> Specifically, in PerMsgStatus::get_decoded_body_text_array(), which 2.6x
> uses to get the uri list from, the un-quoted-printable code is:
> 
>     s/\=3D([0-9A-F]{2})/chr(hex($1))/ge;

Quoted-printable does not allow lowercase letters, so [0-9A-F] is
technically correct (RFC 2045).  However, decoding lowercase seems to be
permitted if the data is definitively quoted-printable (that is, QP is the
Content-Transfer-Encoding or there is a "?Q?"  specifier on the header).

I think this is exactly what we do right now, so no problem.  We might
want to double check some common MUAs to make sure they do actually decode
lowercase hex, though.

One thing I think you said did give me pause.  I'm pretty sure that:

  (in a quoted-printable part)
  http://penistone=3D2eopoloveok=3D2ecom/3/

should only decode to

  http://penistone=2eopoloveok=2ecom/3/

No more than that.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: SA 2.63 -> 3.0 causes degraded rule efficiency.

Posted by Theo Van Dinter <fe...@kluge.net>.

On Thu, Jul 22, 2004 at 12:09:14AM +0200, Jesse Houwing wrote:
> This is the rule in question:
> 
> uri       SARE_URI_EQUALS
> m{^(?:(?:h|%[46]8)(?:t|%[57]4){2}(?:p|%[57]0)(?:s|%[57]3)?(?::|%3a)?(?:%5c|\\|%2f|/){0,2})[^/\?;]+=(?!(?:..)?$).*}i

Hrm.  I have no idea what this is actually looking trying to
match.  The first (?: bit isn't necessary, btw.  Looks like an
URL with a = somewhere in the host section?  ie: something like
'http://penistone=2eopoloveok=2ecom/3/' in a quoted-printable part?
(this is the only set of matches I could find with your RE)

If not, please post an example and I'll be happy to help debug.
(I don't think this is a 3.0 bug though.  See below.)

If so, however: yeah, that'll be different.  In 2.6:

http://penistone=2eopoloveok=2ecom/3/

vs 3.0:

http://penistone.opoloveok.com/3/

which is caused by 2.6 doing a very half-assed attempt at decoding the
quoted-printable part, so you get the QP bits in the URI.  3.0 does the
decoding properly (thanks total MIME parser rewrite!), so you end up
with the URI you're supposed to get, properly decoded.

Specifically, in PerMsgStatus::get_decoded_body_text_array(), which 2.6x
uses to get the uri list from, the un-quoted-printable code is:

    s/\=([0-9A-F]{2})/chr(hex($1))/ge;

which clearly has one flaw: it's looking for case-sensitive A-F!  D'oh!
Therefore, it doesn't match the URI above (uses lowercase).  3.0 does
the right thing here. :)

-- 
Randomly Generated Tagline:
Historically Tcl has always stored all intermediate results as strings.
 (With 8.0 they're rethinking that.  Of course, Perl rethought that from
 the start.)
              -- Larry Wall in <19...@wall.org>