You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Kevin Miller <Ke...@ci.juneau.ak.us> on 2011/04/21 23:54:45 UTC

Regex help

We've been receiving a lot of spam lately which is waltzing right through the spam filters.  I've trained a thousand or more yesterday (we only get around 3-5 thousand legitimate messages a day), but of course it is always changing slightly, and comes from sources that most often aren't yet in any RBLs.

The spam is HTML mail and one thing I've noticed is that there is a large number of html break codes in the body, pushing the 'unsubscribe' down below the bottom of the screen.  (FWIW, clicking on an unsub link generally fires a warning from TrendMicro about the site being compromised.)

Anyway, I'm trying to write a local rule that will scan for 5 or more instances of "<br>" but not having much luck.  I'm testing first on the CLI, just trying to get the syntax down.  

What works:
I have a file called DomainLiterals.txt with repeating characters and it returns expected results:
mkm@mis-mkm-lnx:~$ egrep \[10.]{3} DomainLiterals.txt 
you can add a line containing only [10.10.10.10] to /etc/mail/local-host-names where 10.10.10.10 is the IP address you 

However, doing this fails:
mxg:/var/spool/MailScanner/quarantine/20110421/nonspam # egrep \[<br>]{5,} p3LJZSnX024470
-bash: br: No such file or directory

The file p3LJZSnX024470 is just a plain text file in a quarantine directory.

What am I missing?  I'll turn this into a body rule once I get the syntax right then test it for a day or so w/a score of .01.  If I'm not hitting legitimate mail I'll bump it up.

Thanks...

...Kevin
--
Kevin Miller                Registered Linux User No: 307357
CBJ MIS Dept.               Network Systems Admin., Mail Admin.
155 South Seward Street     ph: (907) 586-0242
Juneau, Alaska 99801        fax: (907 586-4500

Re: Regex help

Posted by Martin Gregorie <ma...@gregorie.org>.
On Thu, 2011-04-21 at 13:54 -0800, Kevin Miller wrote:
> mxg:/var/spool/MailScanner/quarantine/20110421/nonspam # egrep \[<br>]{5,} p3LJZSnX024470
>
That won't do what you want anyway, since its asking for "a sequence of
5 characters, each of which must be one of <,>,b or r" and isn't
allowing for possible whitespace between '<br>' ocurrences. Something
like: 

grep -P '<br>\s{0,5}<br>\s{0,5}<br>' x1.txt

will match on three or more <br> tags with possible whitespace between
them. The -P option says 'use Perl regex syntax' - a recent grep
extension and one you should use when developing SA regexes with grep or
egrep.


Martin



RE: Regex help

Posted by Benny Pedersen <me...@junc.org>.
On Thu, 21 Apr 2011 15:37:02 -0800, Kevin Miller
 
>>> body      CBJ_GiveMeABreak  /\["<br>"]{5,}/
>>> describe  CBJ_GiveMeABreak  Messages with multiple consecutave break
>>> characters score     CBJ_GiveMeABreak  0.01
 
> I'm wading through it, trying to understand it all.  Printed some regex
> tutorial web pages as well.

body rules olso just see rendered html so it needs to be a rawbody to not
strip html tags


RE: Regex help

Posted by Kevin Miller <Ke...@ci.juneau.ak.us>.
Adam Katz wrote:
> Getting back to a viable solution to your actual spam problem...
> 
>> Adam Katz wrote:
>>> How about this rule instead:
>>> 
>>> blacklist_from  *@regionstargpsupdates.com
> 
> On 04/21/2011 04:37 PM, Kevin Miller wrote:
>> Yes, but then I'm playing whack-a-mole.  Looking at the spam in html
>> format (i.e., in the original email) one can see a similarities in
>> style - probably produced from a template.  But the domain varies
>> widely.  I may get anywhere from a half dozen to several dozen from
>> any one domain, then never see that domain again.  Classic botnet
>> behaviour.  These guys cycle through domains and from addresses
>> regularly.
> 
> Okay, I couldn't tell that from your single sample.  Perhaps you can
> post a few more? 

Yeah - it's always a challenge to post just enough info w/o overdoing it and making it harder to separate the wheat from the chaff.


> If it's easier to post in one pass, you can use the following shell
> code (as adjusted to include the proper files rather than my guesses)
> to generate a fake mbox file (/tmp/dump) and then paste that into a
> pastebin:   
> 
> for msg in p3LJZSnX024470 p3LJZSnX024471 p3LJZSnX024472
> p3LJZSnX024473; do echo "From $msg@KM" >>/tmp/dump; cat "$msg"
> >>/tmp/dump; done  
> 
> Fun note:  pastebin.com now supports email syntax highlighting!

OK - posted at http://pastebin.com/R67ut7De

There's four posts.  Only munging was to change the name of the recipient from the real user to "hapless_user".

Hope there's something there that helps others as well as myself.

FWIW, I added your regex this morning with a score of 1.5 and it's working a treat.  No false positives.  I had it at 1 but some messages still slipped through.  They weren't yet in any RBLs and were just under 5.  After they were in some RBLs they generally hit in the 7-8 range.  I'm probably being a little aggressive, but am watching it closely in case I need to back off...


...Kevin
-- 
Kevin Miller                Registered Linux User No: 307357
CBJ MIS Dept.               Network Systems Admin., Mail Admin.
155 South Seward Street     ph: (907) 586-0242
Juneau, Alaska 99801        fax: (907 586-4500

Re: Regex help

Posted by Adam Katz <an...@khopis.com>.
Getting back to a viable solution to your actual spam problem...

> Adam Katz wrote:
>> How about this rule instead:
>> 
>> blacklist_from  *@regionstargpsupdates.com

On 04/21/2011 04:37 PM, Kevin Miller wrote:
> Yes, but then I'm playing whack-a-mole.  Looking at the spam in html
> format (i.e., in the original email) one can see a similarities in
> style - probably produced from a template.  But the domain varies
> widely.  I may get anywhere from a half dozen to several dozen from
> any one domain, then never see that domain again.  Classic botnet
> behaviour.  These guys cycle through domains and from addresses
> regularly.

Okay, I couldn't tell that from your single sample.  Perhaps you can
post a few more?

If it's easier to post in one pass, you can use the following shell code
(as adjusted to include the proper files rather than my guesses) to
generate a fake mbox file (/tmp/dump) and then paste that into a pastebin:

for msg in p3LJZSnX024470 p3LJZSnX024471 p3LJZSnX024472 p3LJZSnX024473;
do echo "From $msg@KM" >>/tmp/dump; cat "$msg" >>/tmp/dump; done

Fun note:  pastebin.com now supports email syntax highlighting!


RE: Regex help

Posted by Kevin Miller <Ke...@ci.juneau.ak.us>.
Adam Katz wrote:
> On 04/21/2011 03:55 PM, Kevin Miller wrote:
>> Thanks (also to Martin who replied).  I posted one of the spams
>> here: http://pastebin.com/9aBAxR7m 
>> 
>> You can see the long series of break codes in it.
> 
> Yes I can.  I can also see several other diagnostic bits in it, such
> as the domain: 
> http://www.siteadvisor.com/sites/regionstargpsupdates.com  
> 
> How about this rule instead:
> 
> blacklist_from  *@regionstargpsupdates.com
> 
> It's much faster and, given the report of the domain being that of a
> spammer, much much safer. 

Yes, but then I'm playing whack-a-mole.  Looking at the spam in html format (i.e., in the original email) one can see a similarities in style - probably produced from a template.  But the domain varies widely.  I may get anywhere from a half dozen to several dozen from any one domain, then never see that domain again.  Classic botnet behaviour.  These guys cycle through domains and from addresses regularly.

One thing that is consistant with all the spams is an exclaimation mark at the end of the subject line.  Sadly, plenty of ham also displays that.

>> Sorry for the confusion on the 10.10.10.10 - that isn't part of the
>> spam, it was just a handy file for testing since it had a repeating
>> string in it.
> 
> It was a faulty test since '[10.]{3}' will match '10.10.10.10' but
> not in the way that you think; it matches the first three characters
> and will therefore also match the string '110.64.323.6'  

Right - caught that from your previous post.  


>> I did get it to work from the CLI, and wrote the following rule:
>> 
>> body      CBJ_GiveMeABreak  /\["<br>"]{5,}/
>> describe  CBJ_GiveMeABreak  Messages with multiple consecutave break
>> characters score     CBJ_GiveMeABreak  0.01
> 
> That will not match your sample.  Please re-read my message.  The
> regex is wrong and the rule type (body) is wrong. 

I'm wading through it, trying to understand it all.  Printed some regex tutorial web pages as well.
I added the rule before any replies showed up but am removing it since it's a valient effort but not hitting where I'd hoped...

...Kevin
-- 
Kevin Miller                Registered Linux User No: 307357
CBJ MIS Dept.               Network Systems Admin., Mail Admin.
155 South Seward Street     ph: (907) 586-0242
Juneau, Alaska 99801        fax: (907 586-4500

Re: Regex help

Posted by Adam Katz <an...@khopis.com>.
On 04/21/2011 03:55 PM, Kevin Miller wrote:
> Thanks (also to Martin who replied).  I posted one of the spams here:
> http://pastebin.com/9aBAxR7m
> 
> You can see the long series of break codes in it.

Yes I can.  I can also see several other diagnostic bits in it, such as
the domain:  http://www.siteadvisor.com/sites/regionstargpsupdates.com

How about this rule instead:

blacklist_from  *@regionstargpsupdates.com

It's much faster and, given the report of the domain being that of a
spammer, much much safer.

> Sorry for the confusion on the 10.10.10.10 - that isn't part of the
> spam, it was just a handy file for testing since it had a repeating
> string in it.

It was a faulty test since '[10.]{3}' will match '10.10.10.10' but not
in the way that you think; it matches the first three characters and
will therefore also match the string '110.64.323.6'

> I did get it to work from the CLI, and wrote the following rule:
> 
> body      CBJ_GiveMeABreak  /\["<br>"]{5,}/
> describe  CBJ_GiveMeABreak  Messages with multiple consecutave break characters
> score     CBJ_GiveMeABreak  0.01

That will not match your sample.  Please re-read my message.  The regex
is wrong and the rule type (body) is wrong.

> I know it may trigger on some ham which is why I set the initial
> score to 0.01.  Better ideas are most welcome though!



RE: Regex help

Posted by Martin Gregorie <ma...@gregorie.org>.
On Thu, 2011-04-21 at 14:55 -0800, Kevin Miller wrote:
> I know it may trigger on some ham which is why I set the initial score
> to 0.01.  Better ideas are most welcome though!
> 
It may be a good idea to look at the headers, especially From, From: and
Message-ID: and at body URIs to see if there are any recognisable
patterns. If so, it may be easier to write rule(s) to match them.


Martin



RE: Regex help

Posted by Kevin Miller <Ke...@ci.juneau.ak.us>.
darxus@chaosreigns.com wrote:
> On 04/21, Adam Katz wrote:
>> rawbody LOCAL_5X_BR_TAGS   /(?:<br\/?>[\s\r\n]{0,4}){5}/mi
> 
> I wonder if it would be useful to generalize this as:
> 
> rawbody LOCAL_8X_TAGS   /(?:<[^>]*>[\s\r\n]{0,4}){8}/mi
> 
> Just a mess of tags in a row without any content.

I'll leave that discussion to those more adept than myself.
 
> On 04/21, Kevin Miller wrote:
>> body      CBJ_GiveMeABreak  /\["<br>"]{5,}/
> 
> Please try to listen to the nice people who are telling you that
> won't do what you think. 

I'm happy to listen to any advice I receive!

I genned up my rule after I posted but before anybody replied.  I'm still trying to grok it all.  And I appreciate all the help I get...

...Kevin
-- 
Kevin Miller                Registered Linux User No: 307357
CBJ MIS Dept.               Network Systems Admin., Mail Admin.
155 South Seward Street     ph: (907) 586-0242
Juneau, Alaska 99801        fax: (907 586-4500

Re: Regex help

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2011-04-21 at 16:08 -0800, Kevin Miller wrote:
> Karsten Bräckelmann wrote:
> > That should do the trick indeed.
> > 
> > After this, I strongly suggest to carefully re-read the entire
> > thread, and read some docs specifically about the points raised. That
> > includes RE peculiarities [1] you used with previous REs without
> > knowing them, as well as my escaping notes with using the shell.   

> Again, thanks very much to all that chimed in.  Lots to digest here,
> and I'm sure I'll still miss some of the finer points, but having a
> real problem to solve is the best way to actually learn this stuff.

True. But don't stop at understanding why the resulting rule works.
Instead, try to understand why and where each and every previous attempt
(avoiding the term RE here) failed.

Of course, I am particularly back at the different levels of escaping.
Think shell. It adds an additional level of interpretation and thus
escaping. Basics, that really can bite your ass. Classic example:

  find . -name '*.pdf'

*Without* the quotes, *.pdf will be expanded by the shell, IFF there are
PDF files in the dir. If there are none, it just works as expected.

If there are, however, the shell will expand the wildcard. Either
leading to an error (here, with more than one PDF file), or silently
ignoring anything that is not named exactly as the one PDF file in the
current dir...

Multiple levels of escaping. As shown in your OP.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


RE: Regex help

Posted by Kevin Miller <Ke...@ci.juneau.ak.us>.
Karsten Bräckelmann wrote:
> On Thu, 2011-04-21 at 15:47 -0800, Kevin Miller wrote:
>> Karsten Bräckelmann wrote:
>>> What you want. The string '<br>', repeated five times (or more). For
>>> the quantifier, you need to group the string.
>>> 
>>>   /(?:<br>){5}/
> 
>> Great.  I've changed my rule to that, and am going to look at Adam's
>> somewhat enhanced version to understand what all it's doing.  To wit:
>>  rawbody LOCAL_5X_BR_TAGS   /(?:<br\/?>[\s\r\n]{0,4}){5}/mi
> 
> That should do the trick indeed.
> 
> After this, I strongly suggest to carefully re-read the entire
> thread, and read some docs specifically about the points raised. That
> includes RE peculiarities [1] you used with previous REs without
> knowing them, as well as my escaping notes with using the shell.   
> 
> 
>> I note that Adam used rawbody rather than body, so I presume that I
>> should change my rule to that as well.
> 
> Yup, he explained why you need that -- otherwise, HTML tags are not
> preserved verbatim, but HTML parts rendered and normalized. 
> 
> 
> [1] PCRE flavor, Perl Compatible REs.

Again, thanks very much to all that chimed in.  Lots to digest here, and I'm sure I'll still miss some of the finer points, but having a real problem to solve is the best way to actually learn this stuff.

Have a great day gentlemen... 

...Kevin
-- 
Kevin Miller                Registered Linux User No: 307357
CBJ MIS Dept.               Network Systems Admin., Mail Admin.
155 South Seward Street     ph: (907) 586-0242
Juneau, Alaska 99801        fax: (907 586-4500

RE: Regex help

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2011-04-21 at 15:47 -0800, Kevin Miller wrote:
> Karsten Bräckelmann wrote:
> > What you want. The string '<br>', repeated five times (or more). For 
> > the quantifier, you need to group the string.
> > 
> >   /(?:<br>){5}/

> Great.  I've changed my rule to that, and am going to look at Adam's
> somewhat enhanced version to understand what all it's doing.  To wit:
>  rawbody LOCAL_5X_BR_TAGS   /(?:<br\/?>[\s\r\n]{0,4}){5}/mi

That should do the trick indeed.

After this, I strongly suggest to carefully re-read the entire thread,
and read some docs specifically about the points raised. That includes
RE peculiarities [1] you used with previous REs without knowing them, as
well as my escaping notes with using the shell.


> I note that Adam used rawbody rather than body, so I presume that I
> should change my rule to that as well.

Yup, he explained why you need that -- otherwise, HTML tags are not
preserved verbatim, but HTML parts rendered and normalized.


[1] PCRE flavor, Perl Compatible REs.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Regex help

Posted by Bowie Bailey <Bo...@BUC.com>.
On 4/21/2011 7:47 PM, Kevin Miller wrote:
>
> Great.  I've changed my rule to that, and am going to look at Adam's somewhat enhanced version to understand what all it's doing.  To wit:
>  rawbody LOCAL_5X_BR_TAGS   /(?:<br\/?>[\s\r\n]{0,4}){5}/mi

It matches:

<br> or <br/>  followed by 0 to 4 whitespace or return characters
...all of that repeated 5 times

-- 
Bowie

RE: Regex help

Posted by Kevin Miller <Ke...@ci.juneau.ak.us>.
Stupid Outlook.  Meant to reply to the list again.  Sigh.

Karsten Bräckelmann wrote:
> 
> What you want. The string '<br>', repeated five times (or more). For 
> the quantifier, you need to group the string.
> 
>   /(?:<br>){5}/
> 
> Besides the above, do not use {5,} as a quantifier, UNLESS there is 
> something after that string you also want to match. If you do not want 
> to match anything after that, "exactly 5 times" {5} will match always 
> the same as "five or more" {5,} -- the latter just
> unnecessarily keeps on trying.    

Great.  I've changed my rule to that, and am going to look at Adam's somewhat enhanced version to understand what all it's doing.  To wit:
 rawbody LOCAL_5X_BR_TAGS   /(?:<br\/?>[\s\r\n]{0,4}){5}/mi

I note that Adam used rawbody rather than body, so I presume that I should change my rule to that as well.

Thanks... 

...Kevin
-- 
Kevin Miller                Registered Linux User No: 307357
CBJ MIS Dept.               Network Systems Admin., Mail Admin.
155 South Seward Street     ph: (907) 586-0242
Juneau, Alaska 99801        fax: (907 586-4500

RE: Regex help

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2011-04-21 at 14:55 -0800, Kevin Miller wrote:
> I did get it to work from the CLI, and wrote the following rule:
> 
> body      CBJ_GiveMeABreak  /\["<br>"]{5,}/

This still is wrong. Something that has been mentioned, but not properly
explained to you is the char class, denoted by square brackets. The
RE /[bar]/ will match any char in the class, that is either a "b", an
"a" or an "r".

In this case (the rule above) it is NOT a char class though, because you
backslash escaped the opening square bracket, turning it into the char
itself. The reason the RE (the part inside the slash / delimiters) DID
work with grep on the command line is, because the slash escaped the
opening square bracket for your shell, preventing your *shell* from
interpreting it -- but the RE passed to your grep features the square
bracket, turning it again into a char class. Multiple levels of
escaping. If you wanna test an RE with grep, seriously better 'single
quote' the entire RE, rather than escaping single chars. This will
prevent such issues.

grep on your shell was looking for any char of the class [<>br], 5
times. That matches the string '<br><'.

For perl, with one less interpretation of the string (no shell), it
looks for the string '["<br>"]]]]]'

Yes, the double-quotes prevented your shell from interpreting < as
STDIN, like it was breaking your command in the OP. Without the shell,
it just is a char, though. Also, the {5,} operates on the thingy in
front of it -- which is a single char here, because you did not (?:)
group the leading sub-RE.


What you want. The string '<br>', repeated five times (or more). For the
quantifier, you need to group the string.

  /(?:<br>){5}/

Besides the above, do not use {5,} as a quantifier, UNLESS there is
something after that string you also want to match. If you do not want
to match anything after that, "exactly 5 times" {5} will match always
the same as "five or more" {5,} -- the latter just unnecessarily keeps
on trying.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


RE: Regex help

Posted by Kevin Miller <Ke...@ci.juneau.ak.us>.
Opps - this should have gone to the list.  Sorry.

Adam Katz wrote:
> Before I help you with your shell and regex issues, I should point out 
> that this is not a very strong rule.  It will hit ham.
SNIP
> 
> Better solution:  put some examples up on a pastebin and link them to 
> us so we can help you find more diagnostic (and simpler) patterns to 
> nail them with.

Thanks (also to Martin who replied).  I posted one of the spams here:
http://pastebin.com/9aBAxR7m

You can see the long series of break codes in it.

Sorry for the confusion on the 10.10.10.10 - that isn't part of the spam, it was just a handy file for testing since it had a repeating string in it.

I did get it to work from the CLI, and wrote the following rule:

body      CBJ_GiveMeABreak  /\["<br>"]{5,}/
describe  CBJ_GiveMeABreak  Messages with multiple consecutave break characters
score     CBJ_GiveMeABreak  0.01

I know it may trigger on some ham which is why I set the initial score to 0.01.  Better ideas are most welcome though!

Thanks much...

...Kevin
-- 
Kevin Miller                Registered Linux User No: 307357
CBJ MIS Dept.               Network Systems Admin., Mail Admin.
155 South Seward Street     ph: (907) 586-0242
Juneau, Alaska 99801        fax: (907 586-4500

Re: Regex help

Posted by Adam Katz <an...@khopis.com>.
> "egrep '[<br>]{5,}' p3L..." prevents the shell from trying to interpret
> your query but still has a bad query, as it looks for five or more
> consecutive occurrences of any character listed between the angle
> brackets, so "<b>brr</b>" will match up to the slash.

Between the square brackets ("[" and "]"), sorry.
Angle brackets ("[<" and ">") have no special meaning in PCRE (though
they're word boundaries in vim's very-magic regexps) while square
brackets note character classes as noted in "man perlre"

(I always chuckle when I see them called that; makes me want to do
something like '[[:paladin:]]*?' ... or in vim, '\v[[:paladin:]]{-}'
which looks for a very magical member of the "paladin" class in a group
that is not greedy.  Too bad I can't also specify race.  Maybe I can
create a race condition?)


Re: Regex help

Posted by Adam Katz <an...@khopis.com>.
On 04/21/2011 05:22 PM, John Hardin wrote:
> On Thu, 21 Apr 2011, Adam Katz wrote:
> 
>> rawbody LOCAL_5X_BR_TAGS   /(?:<br\/?>[\s\r\n]{0,4}){5}/mi
> 
> ...when does \s{0,4} not match the same text as [\s\r\n]{0,4} ?
> 
> (i.e. \r and \n are whitespace, no?)

I believe they are identical assuming /msi flags.  I seem to recall a
particular problem with the engine having trouble here, though that was
probably related to rendered bodies on systems that determine line
breaks differently.  It may instead be related to something specific
with my company's implementation, which is rather nonstandard.

Finally, [\s\r\n] is more legible for troubleshooting as it acts as a
reminder of what is going on.  In the event that there is an efficiency
issue, \s is first.


Re: Regex help

Posted by Adam Katz <an...@khopis.com>.
On 04/22/2011 07:02 AM, Joseph Brennan wrote:
> I'd be cautious with this.
> 
> I have tried scoring for multiple <br> and also for more than ten 
> closing </div> in a row, but unless you score very low, you'll get 
> false positives. Unfortunately some legitimate software products 
> translate their native format into HTML with ugly code like that.
> 
> It could be that a meta of multiple <br> plus something else gets a
> more accurate spam diagnosis, so I'm not saying it's useless, but it
> is not as straightforward as it seems.

+1

My mention of this may have been lost in the noise, especially given how
I've continued along this path intellectually.


Re: Regex help

Posted by Joseph Brennan <br...@columbia.edu>.
I'd be cautious with this.

I have tried scoring for multiple <br> and also for more than ten
closing </div> in a row, but unless you score very low, you'll get
false positives. Unfortunately some legitimate software products
translate their native format into HTML with ugly code like that.

It could be that a meta of multiple <br> plus something else gets
a more accurate spam diagnosis, so I'm not saying it's useless, but
it is not as straightforward as it seems.

Joseph Brennan
Columbia University Information Technology




Re: Regex help

Posted by John Hardin <jh...@impsec.org>.
On Thu, 21 Apr 2011, Adam Katz wrote:

> rawbody LOCAL_5X_BR_TAGS   /(?:<br\/?>[\s\r\n]{0,4}){5}/mi

...when does \s{0,4} not match the same text as [\s\r\n]{0,4} ?

(i.e. \r and \n are whitespace, no?)

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Our government wants to do everything it can "for the children,"
   except sparing them crushing tax burdens.
-----------------------------------------------------------------------
  2 days until Max Planck's 153rd birthday

Re: Regex help

Posted by Adam Katz <an...@khopis.com>.
Before I help you with your shell and regex issues, I should point out
that this is not a very strong rule.  It will hit ham.

On 04/21/2011 02:54 PM, Kevin Miller wrote:
> I'm trying to write a local rule that will scan for 5 or more 
> instances of "<br>" but not having much luck.  I'm testing first on 
> the CLI, just trying to get the syntax down.

> What works:
> I have a file called DomainLiterals.txt with repeating characters
> and it returns expected results:
> mkm@mis-mkm-lnx:~$ egrep \[10.]{3} DomainLiterals.txt 
> you can add a line containing only [10.10.10.10] to
> /etc/mail/local-host-names where 10.10.10.10 is the IP address you

The regex '\10.]{3}' is invalid.  It un-escapes from the command line as
'[10.]{3}' but will match any of these:

111
...
000
10.
.01

since it is asking for three of any character matching one, zero, or
dot.  The grouping symbol you are looking for is a curly-bracket, and
the dot (when outside a square bracket) must be escaped as it otherwise
means "any single character."

> However, doing this fails:
> mxg:/var/spool/MailScanner/quarantine/20110421/nonspam # egrep \[<br>]{5,} p3LJZSnX024470
> -bash: br: No such file or directory
> 
> The file p3LJZSnX024470 is just a plain text file in a quarantine directory.

Again, you have a CLI escaping issue AND a regex issue.  If you are not
quoting that query, you need to escape almost every single punctuation
character listed there.  Alternatively, you could put that query in quotes.

"egrep \[<br>]{5,} p3L..." tells the shell that you are looking for the
query "[" from input file "br" and you want to output your results to
(invalid) file "]" and then run the command "5," in a subshell, followed
by a third command (your email file).

"egrep '[<br>]{5,}' p3L..." prevents the shell from trying to interpret
your query but still has a bad query, as it looks for five or more
consecutive occurrences of any character listed between the angle
brackets, so "<b>brr</b>" will match up to the slash.

> What am I missing? I'll turn this into a body rule once I get the
> syntax right then test it for a day or so w/a score of .01. If I'm not
> hitting legitimate mail I'll bump it up.

On top of all of this, egrep does not use Perl-compatible regular
expressions (PCRE) (though the regexps I've used so far are compatible
with Posix regexps as well as PCRE).  See 'man perlre' (or your favorite
website) for help on PCREs.  Try using either grep -P (requires
libpcre3) or pcregrep (which you may have to install) or else perl
itself, like:

  perl -ne 'print if /whatever/'  < DomainLiterals.txt

As to what that should be searching for, I suspect you want a multi-line
expression (which none of the above shell commands will help you with
since they parse one line at a time).  Try this:

header  LOCAL_10_10_10_10  X-Spam-Relays-Untrusted
   =~ /^[^\[]+ ip=(?:10\.){3}/

rawbody LOCAL_5X_BR_TAGS   /(?:<br\/?>[\s\r\n]{0,4}){5}/mi

That second one will also match <br/> and allows for a few spaces, tabs,
or linebreaks in between the <br> tags.  For a more strict version of
what you're looking for, try this:

rawbody LOCAL_5X_BR_TAGS   /(?:<br>){5}/i

Note that you need rawbody since body rules will strip HTML.


Again, this rule will hit some hams.  It is also not terribly CPU-efficient.

Better solution:  put some examples up on a pastebin and link them to us
so we can help you find more diagnostic (and simpler) patterns to nail
them with.