You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Dan <a...@patnode.net> on 2006/05/15 23:07:37 UTC

Comment Crashes

I'm running into more comment counting problems:


This crashes SA:
	full FloatingTags1 /(>\s?[\$%A-Z0-9]\s?<.*?){90,}/is


This does not:
	full FloatingTags2 /(>\s?[\$%A-Z0-9]\s?<.*?){30,}/is


while this doesn't crash, but also doesn't function:
	full FloatingTags3 /(?>>\s?[\$%A-Z0-9]\s?<.*?){90,}/is


Based on Matt's recent comments:

> Yes, but across the entire message body using .* in a rule is  
> REALLY slow.
>
> I didn't say that counting was impossible with rules, I said it is  
> not good at it.

> Counting occurrences of something across the entire body of the  
> message
> is not something SA is good at with just rules. You'd need a plugin to
> do it.

My premise is wrong.  Do I just need to give up regex for this and  
find a way to eval it (I havn't learned Pearl yet!)?

Thanks!
Dan

Re: Comment Crashes

Posted by Stuart Johnston <st...@ebby.com>.

Dan wrote:
>> If you could give us a sample of what you are trying to match, maybe 
>> we could suggest an alternate route.
> 
> Stuart,
> 
> Its lines and lines of this kind of thing:
> 
> "> <DIV> <STRONG> V</STRONG></DIV> <DIV> L</DIV> <DIV> A</DIV> <DIV> 
> <STRONG> V</STRONG></DIV> <DIV> P</DIV> <DIV> X</DIV> <DIV> <STRONG> 
> C</STRONG></DIV> </DIV>
> <DIV

I generally find it much easier to match against the text part.  Or do 
your messages not have text parts?

-Stuart

Re: Comment Crashes

Posted by Craig McLean <cr...@fukka.co.uk>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

David B Funk wrote:
> On Tue, 16 May 2006, Craig McLean wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> [snipped]
>>
>> I use this style to catch a couple of common text formatting oddities
>> caused by machine-generated input, see:
>> http://fukka.co.uk/sa-rules/local/textstyles.cf
>>
>> Thinking about it, this stuff will nest fairly well, so this should work:
>>
>> rawbody T_30_DODGY_DIVS m'(?:<DIV>\s{0,}?[\$%\w]\s{0,}?</DIV>.{1,40}?){30}'i
>>
>> Stick with rawbody, you don't need full. Also, you'll probably want
>> case-insensitive, and \s{0,}? to match zero or more whitespace.
> 
> Only problem with that is "rawbody" processes the original message one
> line at a time,  unlike "full" or "body" which concatinate the whole
> message into one large string. So if you're looking for some
> characteristic of a message which is spread accross multiple lines of
> input you cannot use "rawbody".

Bugger, you are correct of course. My thanks to you and Sanford Whiteman
 for reminding me that rawbody doesn't (yet) allow multiline matches.

It's 2 AM, I shouldn't be allowed near email :-(

C.

- --
Craig McLean		http://fukka.co.uk
craig@fukka.co.uk	Where the fun never starts
	Powered by FreeBSD, and GIN!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iD8DBQFEaSY5MDDagS2VwJ4RAnDXAJ9IkMhnjIwhhjWad4KfbZWYYxarjACdFccH
/0Fq/bDhx3WUgS5fCwelKk0=
=x5Ln
-----END PGP SIGNATURE-----

Re: Comment Crashes

Posted by David B Funk <db...@engineering.uiowa.edu>.

On Tue, 16 May 2006, Craig McLean wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> [snipped]
>
> I use this style to catch a couple of common text formatting oddities
> caused by machine-generated input, see:
> http://fukka.co.uk/sa-rules/local/textstyles.cf
>
> Thinking about it, this stuff will nest fairly well, so this should work:
>
> rawbody T_30_DODGY_DIVS m'(?:<DIV>\s{0,}?[\$%\w]\s{0,}?</DIV>.{1,40}?){30}'i
>
> Stick with rawbody, you don't need full. Also, you'll probably want
> case-insensitive, and \s{0,}? to match zero or more whitespace.

Only problem with that is "rawbody" processes the original message one
line at a time,  unlike "full" or "body" which concatinate the whole
message into one large string. So if you're looking for some
characteristic of a message which is spread accross multiple lines of
input you cannot use "rawbody".

Thus you are -very- unlikely to find that 30 repetitions of your pattern
in one of the lines of the input message.

This 'feature' of rawbody has already been the subject of various threads
on this list.

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: Comment Crashes

Posted by Craig McLean <cr...@fukka.co.uk>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dan wrote:
>> Hmmm, four DIVs, near each other, each with a single alpha and
>> whitespace. May not be what you are trying to catch, but it's the only
>> real pattern I can see from that snippet.
>>
>> rawbody T_4_DODGY_DIVS
>> m'<DIV>\s+\w</DIV>.{1,40}?<DIV>\s+\w</DIV>.{1,40}?<DIV>\s+\w</DIV>.{1,40}?<DIV>\s+\w</DIV>'i
>>
>> describe T_4_DODGY_DIVS Testing...
>> score T_4_DODGY_DIVS    0.01
> 
> Interesting, instead asking for the count, you are actually showing it
> how many.  Scaled up to 30 and adding space variations, it would look like:
> 
> 
[snipped]

I use this style to catch a couple of common text formatting oddities
caused by machine-generated input, see:
http://fukka.co.uk/sa-rules/local/textstyles.cf

Thinking about it, this stuff will nest fairly well, so this should work:

rawbody T_30_DODGY_DIVS m'(?:<DIV>\s{0,}?[\$%\w]\s{0,}?</DIV>.{1,40}?){30}'i

Stick with rawbody, you don't need full. Also, you'll probably want
case-insensitive, and \s{0,}? to match zero or more whitespace.

C.
- --
Craig McLean		http://fukka.co.uk
craig@fukka.co.uk	Where the fun never starts
	Powered by FreeBSD, and GIN!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iD8DBQFEaQ+fMDDagS2VwJ4RAiJdAKDfS/Nila7mMDnG3FBBQ10gRX0oHQCgiXt9
vzH0Cu0GJrL/Nc5gxJa1D/c=
=Rh9D
-----END PGP SIGNATURE-----

Re: Comment Crashes

Posted by Dan <a...@patnode.net>.

> Hmmm, four DIVs, near each other, each with a single alpha and
> whitespace. May not be what you are trying to catch, but it's the only
> real pattern I can see from that snippet.
>
> rawbody T_4_DODGY_DIVS
> m'<DIV>\s+\w</DIV>.{1,40}?<DIV>\s+\w</DIV>.{1,40}?<DIV>\s+\w</DIV>. 
> {1,40}?<DIV>\s+\w</DIV>'i
> describe T_4_DODGY_DIVS Testing...
> score T_4_DODGY_DIVS    0.01

Interesting, instead asking for the count, you are actually showing  
it how many.  Scaled up to 30 and adding space variations, it would  
look like:


full MEGATAGS /<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9] 
\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A- 
Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$% 
A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\ 
$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s? 
[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV> 
\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}? 
<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>. 
{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</ 
DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s? 
</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9] 
\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A- 
Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$% 
A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\ 
$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s? 
[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV>\s?[\$%A-Z0-9]\s?</DIV>.{1,40}?<DIV> 
\s?[\$%A-Z0-9]\s?</DIV>/i


> full FloatingTags3 /(?>>\s?[\$%A-Z0-9]\s?<.{,50}?){90}/is

> full FloatingTags3 /(?>>\s?[\$%A-Z0-9]\s?<[^>]{,50}){90}/is

These didn't crash but didn't work either.


> I'm a little confused as to why you're using (?>>...) instead of  
> (?:...)

Sorry for the confusion.  If I understood all of this better, it  
would not seem so disjointed.  ?>  is called an Atomic Group.   
Another suggestion I don't understand yet myself.


> body L_drug_float /(:?[PVLXVAC]\s){7}/

This is working and I can't figure out why.    :)


Thanks everyone!!

Dan

Re: Comment Crashes

Posted by Craig McLean <cr...@fukka.co.uk>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dan wrote:
>> If you could give us a sample of what you are trying to match, maybe
>> we could suggest an alternate route.
> 
> Stuart,
> 
> Its lines and lines of this kind of thing:
> 
> "> <DIV> <STRONG> V</STRONG></DIV> <DIV> L</DIV> <DIV> A</DIV> <DIV>
> <STRONG> V</STRONG></DIV> <DIV> P</DIV> <DIV> X</DIV> <DIV> <STRONG>
> C</STRONG></DIV> </DIV>
> <DIV
> 
> Dan
> 

Hmmm, four DIVs, near each other, each with a single alpha and
whitespace. May not be what you are trying to catch, but it's the only
real pattern I can see from that snippet.

rawbody T_4_DODGY_DIVS
m'<DIV>\s+\w</DIV>.{1,40}?<DIV>\s+\w</DIV>.{1,40}?<DIV>\s+\w</DIV>.{1,40}?<DIV>\s+\w</DIV>'i
describe T_4_DODGY_DIVS Testing...
score T_4_DODGY_DIVS    0.01

(note, the regexp should be on one line with no spaces)

That will catch it. You'd have to see what it FPs on though.
You could also get it to pick on single alphas between html tags with a
little tweaking.

C.

- --
Craig McLean		http://fukka.co.uk
craig@fukka.co.uk	Where the fun never starts
	Powered by FreeBSD, and GIN!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iD8DBQFEaPXUMDDagS2VwJ4RAgjdAJ9Uv7TmKzEeE4ee8zh51r7J8UFbvwCgywG0
ZGaVPYHX6X9+e5e5+fUGDFM=
=/hQ0
-----END PGP SIGNATURE-----

Re: Comment Crashes

Posted by Dan <a...@patnode.net>.

> If you could give us a sample of what you are trying to match,  
> maybe we could suggest an alternate route.

Stuart,

Its lines and lines of this kind of thing:

"> <DIV> <STRONG> V</STRONG></DIV> <DIV> L</DIV> <DIV> A</DIV> <DIV>  
<STRONG> V</STRONG></DIV> <DIV> P</DIV> <DIV> X</DIV> <DIV> <STRONG>  
C</STRONG></DIV> </DIV>
<DIV

Dan

Re: Comment Crashes

Posted by Stuart Johnston <st...@ebby.com>.

Dan wrote:
> I'm running into more comment counting problems:
> 
> 
> This crashes SA:
> full FloatingTags1 /(>\s?[\$%A-Z0-9]\s?<.*?){90,}/is
> 
> 
> This does not:
> full FloatingTags2 /(>\s?[\$%A-Z0-9]\s?<.*?){30,}/is
> 
> 
> while this doesn't crash, but also doesn't function:
> full FloatingTags3 /(?>>\s?[\$%A-Z0-9]\s?<.*?){90,}/is
> 
> 
> Based on Matt's recent comments:
> 
>> Yes, but across the entire message body using .* in a rule is REALLY slow.
>>
>> I didn't say that counting was impossible with rules, I said it is not 
>> good at it.
> 
>> Counting occurrences of something across the entire body of the message
>> is not something SA is good at with just rules. You'd need a plugin to
>> do it.
> 
> My premise is wrong.  Do I just need to give up regex for this and find 
> a way to eval it (I havn't learned Pearl yet!)?

If you could give us a sample of what you are trying to match, maybe we 
could suggest an alternate route.

-Stuart