You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "J." <sw...@yahoo.com> on 2007/04/07 00:19:09 UTC

Rule debugging

I got a false positive that was triggered by this:

body    MY_VIAG    
/\b(v(i|l)a.{0,4}g.{0,4}r.{0,4}a)|(v.{0,4}(i|l).{0,4}a.{0,4}gra)/i
score   MY_VIAG     5

But when I try to see what it matched using this:

grep -i '(v(i|l)a.{0,4}g.{0,4}r.{0,4}a)|(v.{0,4}(i|l).{0,4}a.{0,4}gra)'
/home/domainmail/debug/ham.eml

I get no output. Is there a better way to find what matched? Thanks.

-Jason


 
____________________________________________________________________________________
No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.
http://mobile.yahoo.com/mail 

Re: Rule debugging

Posted by "J." <sw...@yahoo.com>.
--- guenther <gu...@rudersport.de> wrote:

> Please do not hijack other threads by replying to a mail, if you
> actually mean to start an unrelated thread. Removing the quoted text
> is
> not sufficient.
> 
> On Fri, 2007-04-06 at 15:19 -0700, J. wrote:
> > I got a false positive that was triggered by this:
> > 
> > body    MY_VIAG   
> /\b(v(i|l)a.{0,4}g.{0,4}r.{0,4}a)|(v.{0,4}(i|l).{0,4}a.{0,4}gra)/i
> > score   MY_VIAG    5
> 
> I don't think you want it like that. ;)  See below.
> 
> > But when I try to see what it matched using this:
> > 
> > grep -i
> '(v(i|l)a.{0,4}g.{0,4}r.{0,4}a)|(v.{0,4}(i|l).{0,4}a.{0,4}gra)'
> > /home/domainmail/debug/ham.eml
> > 
> > I get no output. Is there a better way to find what matched?
> Thanks.
> 
> Since the RE is rather simple and straight forward, looking at the
> mail
> in your client, reading the text, should do.
> 
> 
> vi ain't a grand editor.
> 
> Oops, that matches your RE. :)  You probably should have another look
> at
> it, trying to imagine what it possibly can match that you don't
> intend
> to flag as spam. With all these .{0,4} in it, there are a *lot* of
> possibilities.
> 
> Also, IMHO it is a bad idea to score 5.0 points for any rule that is
> not
> fail proof. Generally, it should be considered a no-no. I do use it
> for
> one handcrafted rule though, that matches a sender domain that
> definitely must be faked -- because I happen to own the domain. In
> any
> case where you can not eliminate any FP you should *not* use a score
> that high. You won't find any single stock SA rule either, that alone
> classifies a mail as spam. Carefully assigned scores based on the
> "spammyness" and accumulating scores of multiple hit tests is a basic
> concept of SA.

Thanks Guenther. I didn't realize that thread info would be stored in
the reply even if I changed the subject. I guess I'll add the list
address to my book or something.

The problem I was having that made me need to create my own rule was
that there were a bunch of false negatives one day that were evading
all the sa rules including the ones meant to catch stuff like the word
in question (even when obfuscated).

Here is the work file I was using to help create the rule I ended up
using. As you can see from some of the lines the {0,4} was necessary or
the rule would miss some of them:

http://binaryops.com/spamwork.txt

I guess I'll lower the score to 2 or something and see what happens.
Most of the spam that gets hit by my rule also gets hit by
DRUGS_ERECTILE but like I said, a bunch slipped through one day and I
was determined to stop them.


 
____________________________________________________________________________________
Never miss an email again!
Yahoo! Toolbar alerts you the instant new Mail arrives.
http://tools.search.yahoo.com/toolbar/features/mail/

Re: Rule debugging

Posted by guenther <gu...@rudersport.de>.
Please do not hijack other threads by replying to a mail, if you
actually mean to start an unrelated thread. Removing the quoted text is
not sufficient.

On Fri, 2007-04-06 at 15:19 -0700, J. wrote:
> I got a false positive that was triggered by this:
> 
> body    MY_VIAG    /\b(v(i|l)a.{0,4}g.{0,4}r.{0,4}a)|(v.{0,4}(i|l).{0,4}a.{0,4}gra)/i
> score   MY_VIAG    5

I don't think you want it like that. ;)  See below.

> But when I try to see what it matched using this:
> 
> grep -i '(v(i|l)a.{0,4}g.{0,4}r.{0,4}a)|(v.{0,4}(i|l).{0,4}a.{0,4}gra)'
> /home/domainmail/debug/ham.eml
> 
> I get no output. Is there a better way to find what matched? Thanks.

Since the RE is rather simple and straight forward, looking at the mail
in your client, reading the text, should do.


vi ain't a grand editor.

Oops, that matches your RE. :)  You probably should have another look at
it, trying to imagine what it possibly can match that you don't intend
to flag as spam. With all these .{0,4} in it, there are a *lot* of
possibilities.

Also, IMHO it is a bad idea to score 5.0 points for any rule that is not
fail proof. Generally, it should be considered a no-no. I do use it for
one handcrafted rule though, that matches a sender domain that
definitely must be faked -- because I happen to own the domain. In any
case where you can not eliminate any FP you should *not* use a score
that high. You won't find any single stock SA rule either, that alone
classifies a mail as spam. Carefully assigned scores based on the
"spammyness" and accumulating scores of multiple hit tests is a basic
concept of SA.

  guenther


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Rule debugging

Posted by "J." <sw...@yahoo.com>.
--- Theo Van Dinter <fe...@apache.org> wrote:

> On Fri, Apr 06, 2007 at 03:19:09PM -0700, J. wrote:
> > body    MY_VIAG    
> > /\b(v(i|l)a.{0,4}g.{0,4}r.{0,4}a)|(v.{0,4}(i|l).{0,4}a.{0,4}gra)/i
> 
> just because it makes me cringe:
>   - use [il] for character classes, not (i|l)
>   - don't use (...) if you don't need capturing, use (?:...)
> 
> :)
> 
> > grep -i
> '(v(i|l)a.{0,4}g.{0,4}r.{0,4}a)|(v.{0,4}(i|l).{0,4}a.{0,4}gra)'
> > /home/domainmail/debug/ham.eml
> > 
> > I get no output. Is there a better way to find what matched?
> Thanks.
> 
> grep doesn't do perl regular expressions (though typically a gnu grep
> compiled
> w/ pcre will have an option).  you may want to use pcregrep.

Thanks, not sure I understand the "?:" comment, but here's the result
of my taking your suggestions:

body    MY_VIAG    
/\b(?:v[il]a.{0,4}g.{0,4}r.{0,4}a)|(?:v.{0,4}[il].{0,4}a.{0,4}gra)/i
score   MY_VIAG     5

It still hits this rule when I run sa from the command line, but when I
run this:

pcregrep -i
'(?:v[il]a.{0,4}g.{0,4}r.{0,4}a)|(?:v.{0,4}[il].{0,4}a.{0,4}gra)'
/home/domainmail/debug/ham.eml

I still get no output, so I can't see what it's matching that it
shouldn't.



 
____________________________________________________________________________________
Don't pick lemons.
See all the new 2007 cars at Yahoo! Autos.
http://autos.yahoo.com/new_cars.html 

Re: Rule debugging

Posted by Theo Van Dinter <fe...@apache.org>.
On Fri, Apr 06, 2007 at 03:19:09PM -0700, J. wrote:
> body    MY_VIAG    
> /\b(v(i|l)a.{0,4}g.{0,4}r.{0,4}a)|(v.{0,4}(i|l).{0,4}a.{0,4}gra)/i

just because it makes me cringe:
  - use [il] for character classes, not (i|l)
  - don't use (...) if you don't need capturing, use (?:...)

:)

> grep -i '(v(i|l)a.{0,4}g.{0,4}r.{0,4}a)|(v.{0,4}(i|l).{0,4}a.{0,4}gra)'
> /home/domainmail/debug/ham.eml
> 
> I get no output. Is there a better way to find what matched? Thanks.

grep doesn't do perl regular expressions (though typically a gnu grep compiled
w/ pcre will have an option).  you may want to use pcregrep.

-- 
Randomly Selected Tagline:
"You're basically killing each other to see who's got the better imaginary
 friend." - Richard Jeni (on going to war over religion)

Re: Rule debugging

Posted by Matt Kettler <mk...@verizon.net>.
John D. Hardin wrote:
> On Fri, 6 Apr 2007, J. wrote:
>
>   
>> I get no output. Is there a better way to find what matched? Thanks.
>>     
>
> Use egrep instead of grep?
>
> You might want to look at this instead of trying to hand-roll 
> obfuscation rules:
>
> http://www.impsec.org/~jhardin/antispam/obfusc.pl

Or even better, make use of the ReplaceTags plugin. see examples in
25_replace.cf.



Re: Rule debugging

Posted by "J." <sw...@yahoo.com>.
--- "John D. Hardin" <jh...@impsec.org> wrote:

> On Sat, 7 Apr 2007, J. wrote:
> 
> > --- "John D. Hardin" <jh...@impsec.org> wrote:
> > 
> > > You might want to look at this instead of trying to hand-roll 
> > > obfuscation rules:
> > > 
> > > http://www.impsec.org/~jhardin/antispam/obfusc.pl
> > 
> > Thanks John. I have no idea what the program does but it does seem
> > to catch a lot of the stuff I was going after.
> 
> Basically, given a word list and scores it generates re's to catch
> most simple obfuscations of those words. Theo is right, it largely
> overlaps the ReplaceTags plugin stuff, but I think there are a few
> obfuscations that it catches that ReplaceTags does not (after an
> admittedly brief look at ReplaceTags)...
> 
> > The re is huge so I can't easily figure out what it's doing, but
> > it does miss some of the spam I was targeting with my rule though.
> > for example this one:
> > 
> > http://binaryops.com/spam3.txt
> 
> Yeah, at some point the obfuscation becomes problematic to detect
> with 
> a low rate of false positives, and it is to some degree a game of 
> whack-a-mole.
> 
> However, if the obfuscation becomes complex enough to be difficult to
> automatically detect, it becomes that much more difficult for the 
> victim to be able to *read* and make sense of, so the more esoteric 
> obfuscations become self-limiting.
> 
> > It was mail like that which forced me to use the .{0,4} clauses in
> my
> > rule. I'm probably causing some false positives though especially
> since
> > my scoring is really high.
> 
> Using .{0,4} is far too loose and will cause massive FPs. It's a 
> little better to try to match the specific extreme obfuscation 
> technique, in this case (?:\s[a-z]{2}\s)? (from your sample). Of 
> course, this will probably rot quickly.
> 
> Did you also create a rule for the "from $3, 33" parts? 
> --
>  John Hardin KA7OHZ

Actually the re in the rule was the only thing I could figure out that
actually matched all the spam that was getting through that day. I'm
not sure how common those kinds of mails are now, but I lowered the
scoring a lot in my rule so hopefully it won't cause (m)any fps. I
didn't bother with the $3, 33 part but you're right that it might be a
good way to avoid trouble if I make that part of the re. Here's the
work file I used while making the re:

http://binaryops.com/spamwork.txt



 
____________________________________________________________________________________
Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list&sid=396546091

Re: Rule debugging

Posted by "John D. Hardin" <jh...@impsec.org>.
On Sat, 7 Apr 2007, J. wrote:

> --- "John D. Hardin" <jh...@impsec.org> wrote:
> 
> > You might want to look at this instead of trying to hand-roll 
> > obfuscation rules:
> > 
> > http://www.impsec.org/~jhardin/antispam/obfusc.pl
> 
> Thanks John. I have no idea what the program does but it does seem
> to catch a lot of the stuff I was going after.

Basically, given a word list and scores it generates re's to catch
most simple obfuscations of those words. Theo is right, it largely
overlaps the ReplaceTags plugin stuff, but I think there are a few
obfuscations that it catches that ReplaceTags does not (after an
admittedly brief look at ReplaceTags)...

> The re is huge so I can't easily figure out what it's doing, but
> it does miss some of the spam I was targeting with my rule though.
> for example this one:
> 
> http://binaryops.com/spam3.txt

Yeah, at some point the obfuscation becomes problematic to detect with 
a low rate of false positives, and it is to some degree a game of 
whack-a-mole.

However, if the obfuscation becomes complex enough to be difficult to
automatically detect, it becomes that much more difficult for the 
victim to be able to *read* and make sense of, so the more esoteric 
obfuscations become self-limiting.

> It was mail like that which forced me to use the .{0,4} clauses in my
> rule. I'm probably causing some false positives though especially since
> my scoring is really high.

Using .{0,4} is far too loose and will cause massive FPs. It's a 
little better to try to match the specific extreme obfuscation 
technique, in this case (?:\s[a-z]{2}\s)? (from your sample). Of 
course, this will probably rot quickly.

Did you also create a rule for the "from $3, 33" parts?

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Gun Control is marketed to the public using the appealing delusion
  that violent criminals will obey the law.
-----------------------------------------------------------------------
 5 days until Thomas Jefferson's 264th Birthday



Re: Rule debugging

Posted by "J." <sw...@yahoo.com>.
--- "John D. Hardin" <jh...@impsec.org> wrote:

> On Fri, 6 Apr 2007, J. wrote:
> 
> > I get no output. Is there a better way to find what matched?
> Thanks.
> 
> Use egrep instead of grep?
> 
> You might want to look at this instead of trying to hand-roll 
> obfuscation rules:
> 
> http://www.impsec.org/~jhardin/antispam/obfusc.pl
> 
> --
>  John Hardin KA7OHZ                   

Thanks John. I have no idea what the program does but it does seem to
catch a lot of the stuff I was going after. The re is huge so I can't
easily figure out what it's doing, but it does miss some of the spam I
was targeting with my rule though. for example this one:

http://binaryops.com/spam3.txt

It was mail like that which forced me to use the .{0,4} clauses in my
rule. I'm probably causing some false positives though especially since
my scoring is really high.

-Jason


 
____________________________________________________________________________________
Don't pick lemons.
See all the new 2007 cars at Yahoo! Autos.
http://autos.yahoo.com/new_cars.html 

Re: Rule debugging

Posted by "John D. Hardin" <jh...@impsec.org>.
On Fri, 6 Apr 2007, J. wrote:

> I get no output. Is there a better way to find what matched? Thanks.

Use egrep instead of grep?

You might want to look at this instead of trying to hand-roll 
obfuscation rules:

http://www.impsec.org/~jhardin/antispam/obfusc.pl

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  ...much of our country's counterterrorism security spending is not
  designed to protect us from the terrorists, but instead to protect
  our public officials from criticism when another attack occurs.
                                                    -- Bruce Schneier
-----------------------------------------------------------------------
 7 days until Thomas Jefferson's 264th Birthday