You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Kris Deugau <kd...@vianet.ca> on 2012/04/02 18:40:27 UTC

Regex help (targetting very long HTML comments)

Can anyone point out what bit of stupidity I'm committing in trying to 
use this:

rawbody OVERSIZE_COMMENT        m|<!--(?!-->).{32000,}|s

to match messages that are mostly very very long HTML comment(s)?

Testing the same regex against the whole raw message outside of SA seems 
to fire just fine.

-kgd

Re: Regex help (targetting very long HTML comments)

Posted by Henrik K <he...@hege.li>.

On Tue, Apr 03, 2012 at 11:00:56PM +0300, Henrik K wrote:
> On Mon, Apr 02, 2012 at 12:40:27PM -0400, Kris Deugau wrote:
> > Can anyone point out what bit of stupidity I'm committing in trying
> > to use this:
> > 
> > rawbody OVERSIZE_COMMENT        m|<!--(?!-->).{32000,}|s
> > 
> > to match messages that are mostly very very long HTML comment(s)?
> > 
> > Testing the same regex against the whole raw message outside of SA
> > seems to fire just fine.
> 
> HTML parser already has all the information needed. Simply use the existing
> HTMLEval method:
> 
> body OVERSIZE_COMMENT eval:html_text_match('comment', '(?s)^(?=.{32000})')
> 
> (?s) to enable single-line mode
> (?=) lookahead to prevent SA storing the match result (save memory :p)
> 
> This only checks the "main" message body that SA uses. If you want to check
> _all_ mime parts, here's a quick plugin:
> 
> http://sa.hege.li/HTMLComments.pm

PS. Learn something new every day... it seems perlre quantifiers can't be
bigger than 32766. To test anything bigger you need some hack like:
(?=(?:.{1000}){50})

Re: Regex help (targetting very long HTML comments)

Posted by Henrik K <he...@hege.li>.

On Tue, Apr 03, 2012 at 05:25:57PM -0400, Kris Deugau wrote:
> Henrik K wrote:
> >This only checks the "main" message body that SA uses. If you want to check
> >_all_ mime parts, here's a quick plugin:
> >
> >http://sa.hege.li/HTMLComments.pm
> 
> Hm.  Does check_html_comment_length get each tag all by itself?
> Otherwise it looks like the regex in your while() will match a
> message with a short opening comment, $find_len of miscellaneous
> content or HTML tags, and a short closing comment.

If you look at it, it's pretty clear.

<!--(.*?)-->

- Find opening tag <!--
- Stop at first closing tag --> found (non-greedy regex .*?)

We capture the contents to check the length. If it's shorted than wanted,
just find the next comment block.  This testedly works slightly faster than
your complex regex which tries to directly find long matches.

Or did you mean something else?

Re: Regex help (targetting very long HTML comments)

Posted by Kris Deugau <kd...@vianet.ca>.

Henrik K wrote:
> On Mon, Apr 02, 2012 at 12:40:27PM -0400, Kris Deugau wrote:
>> Can anyone point out what bit of stupidity I'm committing in trying
>> to use this:
>>
>> rawbody OVERSIZE_COMMENT        m|<!--(?!-->).{32000,}|s
>>
>> to match messages that are mostly very very long HTML comment(s)?
>>
>> Testing the same regex against the whole raw message outside of SA
>> seems to fire just fine.
>
> HTML parser already has all the information needed. Simply use the existing
> HTMLEval method:
>
> body OVERSIZE_COMMENT eval:html_text_match('comment', '(?s)^(?=.{32000})')

Interesting!  I'll try that out too.

> This only checks the "main" message body that SA uses. If you want to check
> _all_ mime parts, here's a quick plugin:
>
> http://sa.hege.li/HTMLComments.pm

Hm.  Does check_html_comment_length get each tag all by itself? 
Otherwise it looks like the regex in your while() will match a message 
with a short opening comment, $find_len of miscellaneous content or HTML 
tags, and a short closing comment.

-kgd

Re: Regex help (targetting very long HTML comments)

Posted by Henrik K <he...@hege.li>.

On Mon, Apr 02, 2012 at 12:40:27PM -0400, Kris Deugau wrote:
> Can anyone point out what bit of stupidity I'm committing in trying
> to use this:
> 
> rawbody OVERSIZE_COMMENT        m|<!--(?!-->).{32000,}|s
> 
> to match messages that are mostly very very long HTML comment(s)?
> 
> Testing the same regex against the whole raw message outside of SA
> seems to fire just fine.

HTML parser already has all the information needed. Simply use the existing
HTMLEval method:

body OVERSIZE_COMMENT eval:html_text_match('comment', '(?s)^(?=.{32000})')

(?s) to enable single-line mode
(?=) lookahead to prevent SA storing the match result (save memory :p)

This only checks the "main" message body that SA uses. If you want to check
_all_ mime parts, here's a quick plugin:

http://sa.hege.li/HTMLComments.pm

Re: Regex help (targetting very long HTML comments)

Posted by Kris Deugau <kd...@vianet.ca>.

Bowie Bailey wrote:
> Try using a string that's longer than 320 characters that starts with a
> short comment.
>
> i.e.:    '<!-- comment -->  blah blah blah blah.....'
>
> This is where your original version will fail.  Your original regex
> translates as "a string starting with a comment opener followed by at
> least 3200 characters that do not start with a comment closer".  So a
> long string that starts with a short comment will match  your original
> regexp.  I confirmed this by running your code above and moving the
> comment closer from the end to just after the first "foo".

Ah, right, OK.  The example in the Camel Book isn't very clear on 
exactly how the condition attaches to either the regex as a whole, or 
any particular part of it.

>> None of the variants seem to be *too* nasty on the CPU though;  feeding
>> one of these monster messages through a minimal Perl script as above
>> that just runs a handful of regexes showed:
>>
>> real    0m0.050s
>> user    0m0.045s
>> sys     0m0.012s
>
> That doesn't look too bad.  I compared the two variants on my own with a
> large test string (over 32000 chars) and found that the extra
> look-aheads in the working regexp took my case from 26ms to 36ms.
> Probably not enough to cause a problem, but definitely significant.
> However, this only occurs when there is a huge comment.  If the comment
> is small, both versions run the same, so you are probably ok as far as
> that goes.

It's probably a lot nastier on large *legitimate* messages with many 
(small) HTML comments, but those already take a long time to scan anyway 
and the best thing I can do about them is whitelist or blacklist them 
upstream of SA (depending on user preference).

Closer inspection of one of these spams showed it was actually several 
very long HTML comments in between the actual content tags - all four or 
five of them.  Stripping the comments trims it down to less than 1K - 
essentially just a couple of <img> tags pointing to remote servers for 
the actual spam payload images.

-kgd

Re: Regex help (targetting very long HTML comments)

Posted by "David F. Skoll" <df...@roaringpenguin.com>.

[Somewhat OT]

In general, I would be very wary of any regex that has an unbounded
quantifier like +, * or {32000,}

If all you care about is matching something followed by *at least* 32000
copies of something else, you should use:

       /something(?:something_else){32000}/

After all, once you see 32000 of them, you don't care if there could be
10 million more of them.  When you craft regexes, you should always consider
their behaviour on pathological cases like 50MB text emails.

Another option is to use non-greedy quantifiers, but I prefer upper bounds
on quantifiers.

Regards,

David.

Re: Regex help (targetting very long HTML comments)

Posted by Bowie Bailey <Bo...@BUC.com>.

On 4/2/2012 6:03 PM, Kris Deugau wrote:
>> On 4/2/2012 12:58 PM, Stephane Chazelas wrote:
>>> Don't know about the spamassassin issue, but that regexp
>>> matches<!-- followed by a sequence of 32000 of more characters
>>> provided that sequence doesn't start with "-->".
>>>
>>> ITYM
>>>
>>> m|<!--(?:(?!-->).){32000,}|s
>>>
>>> That is you need to look ahead at each character of the sequence
>>> to look for the closing comment tag, otherwise you'll match on
>>> <!-- short comment -->  <31982 or more characters>
> Actually, no, it works as intended.
>
> If you uncomment the string fragments below, the 320-character versions 
> both match but the 32000-character ones don't.  As-is, neither matches.
>
> my $shorty = "<!-- foo bar baxkja safdjwelkj werf kjwlekrjwlekr jlwkerk 
> jawelkj awlekj lakewjflakwjef lakj ".
> #"awelkj alkfj awlekfj lawie fjalwief jlawijfe lawiejflfiwj elifj 
> lawiej4lti j34wlit j43wli jliajs lij flisaj ".
> #"flsaidfj liasjdf lisdj lijsa fldi fa;slkjf;lask j;lkaj fs; jfsdjf sak 
> hflkshf lksj fhlksaj fhlska fhlkajs ".
> "fhlkajshlkjashflkjasdfhlkjsahdflkjas hlkfh lwelif hwli3u fhliuwae 
> fhliuawfheliuhfliu fhwei ufhsd fg/sd ".
> "/dsf/g/sdafg /sdf/ 
> gdf/sg/sdf/gds/g/sd/th/ser/h/ser/ghs/rg/srg/ser/gs/erg/ser/g/ser/g/ser/g/ser/g/serg 
> -->";
>
> my @regex = ('<!--(?:(?!-->).){32000,}', '<!--(?:(?!-->).){320,}', 
> '<!--(?!-->).{32000,}', '<!--(?!-->).{320,}');
>
> foreach (@regex) {
>    print "$_ shorty ok\n" if $shorty =~ m/$_/s;
> }
>
> (And yes, this is almost exactly what I'm seeing in these monster 
> comments, although they're usually at least mostly real words, and they 
> are in the ~100K+-characters length range.)

Try using a string that's longer than 320 characters that starts with a
short comment.

i.e.:    '<!-- comment --> blah blah blah blah.....'

This is where your original version will fail.  Your original regex
translates as "a string starting with a comment opener followed by at
least 3200 characters that do not start with a comment closer".  So a
long string that starts with a short comment will match  your original
regexp.  I confirmed this by running your code above and moving the
comment closer from the end to just after the first "foo".

Output:
<!--(?!-->).{320,} shorty ok

> Bowie Bailey wrote:
>> And you may or may not want to match on a closing comment at the end.
>>
>> m|<!--(?:(?!-->).){32000,}-->|s
> Enh, I don't think it matters.

Maybe not, but I like to be as specific as possible in a regexp.  If it
doesn't end with a comment closer, is it really a comment?

> However, when testing in a minimal Perl script that just tries to match 
> on the whole raw message, my original works fine;  I don't need the 
> extra non-capturing parentheses.

See my comment above.  I don't think you've tested it properly yet.

>
>> Also, because of all of the lookaheads, this may be an expensive
>> regexp.  If you try it, keep a close eye on your SA.  If it slows down
>> to a crawl, this is probably the culprit.
> None of the variants seem to be *too* nasty on the CPU though;  feeding 
> one of these monster messages through a minimal Perl script as above 
> that just runs a handful of regexes showed:
>
> real    0m0.050s
> user    0m0.045s
> sys     0m0.012s

That doesn't look too bad.  I compared the two variants on my own with a
large test string (over 32000 chars) and found that the extra
look-aheads in the working regexp took my case from 26ms to 36ms. 
Probably not enough to cause a problem, but definitely significant. 
However, this only occurs when there is a huge comment.  If the comment
is small, both versions run the same, so you are probably ok as far as
that goes.

-- 
Bowie

Re: Regex help (targetting very long HTML comments)

Posted by Kris Deugau <kd...@vianet.ca>.

>> 2012-04-02 12:40:27 -0400, Kris Deugau:
>>> Can anyone point out what bit of stupidity I'm committing in trying
>>> to use this:
>>>
>>> rawbody OVERSIZE_COMMENT        m|<!--(?!-->).{32000,}|s
>>>
>>> to match messages that are mostly very very long HTML comment(s)?

I've found one way to handle this;  use "full" instead of "rawbody". 
IIRC there is still some chunkifying done to rawbody, so nothing will 
ever match 32K characters of what's provided for rawbody rules.  IIRC 
the limit is somewhere between 2-3K.

> On 4/2/2012 12:58 PM, Stephane Chazelas wrote:
>> Don't know about the spamassassin issue, but that regexp
>> matches<!-- followed by a sequence of 32000 of more characters
>> provided that sequence doesn't start with "-->".
>>
>> ITYM
>>
>> m|<!--(?:(?!-->).){32000,}|s
>>
>> That is you need to look ahead at each character of the sequence
>> to look for the closing comment tag, otherwise you'll match on
>> <!-- short comment -->  <31982 or more characters>

Actually, no, it works as intended.

If you uncomment the string fragments below, the 320-character versions 
both match but the 32000-character ones don't.  As-is, neither matches.

my $shorty = "<!-- foo bar baxkja safdjwelkj werf kjwlekrjwlekr jlwkerk 
jawelkj awlekj lakewjflakwjef lakj ".
#"awelkj alkfj awlekfj lawie fjalwief jlawijfe lawiejflfiwj elifj 
lawiej4lti j34wlit j43wli jliajs lij flisaj ".
#"flsaidfj liasjdf lisdj lijsa fldi fa;slkjf;lask j;lkaj fs; jfsdjf sak 
hflkshf lksj fhlksaj fhlska fhlkajs ".
"fhlkajshlkjashflkjasdfhlkjsahdflkjas hlkfh lwelif hwli3u fhliuwae 
fhliuawfheliuhfliu fhwei ufhsd fg/sd ".
"/dsf/g/sdafg /sdf/ 
gdf/sg/sdf/gds/g/sd/th/ser/h/ser/ghs/rg/srg/ser/gs/erg/ser/g/ser/g/ser/g/ser/g/serg 
-->";

my @regex = ('<!--(?:(?!-->).){32000,}', '<!--(?:(?!-->).){320,}', 
'<!--(?!-->).{32000,}', '<!--(?!-->).{320,}');

foreach (@regex) {
   print "$_ shorty ok\n" if $shorty =~ m/$_/s;
}

(And yes, this is almost exactly what I'm seeing in these monster 
comments, although they're usually at least mostly real words, and they 
are in the ~100K+-characters length range.)

Bowie Bailey wrote:
> And you may or may not want to match on a closing comment at the end.
>
> m|<!--(?:(?!-->).){32000,}-->|s

Enh, I don't think it matters.

However, when testing in a minimal Perl script that just tries to match 
on the whole raw message, my original works fine;  I don't need the 
extra non-capturing parentheses.

> Also, because of all of the lookaheads, this may be an expensive
> regexp.  If you try it, keep a close eye on your SA.  If it slows down
> to a crawl, this is probably the culprit.

None of the variants seem to be *too* nasty on the CPU though;  feeding 
one of these monster messages through a minimal Perl script as above 
that just runs a handful of regexes showed:

real    0m0.050s
user    0m0.045s
sys     0m0.012s

-kgd

Re: Regex help (targetting very long HTML comments)

Posted by Bowie Bailey <Bo...@BUC.com>.

On 4/2/2012 12:58 PM, Stephane Chazelas wrote:
> 2012-04-02 12:40:27 -0400, Kris Deugau:
>> Can anyone point out what bit of stupidity I'm committing in trying
>> to use this:
>>
>> rawbody OVERSIZE_COMMENT        m|<!--(?!-->).{32000,}|s
>>
>> to match messages that are mostly very very long HTML comment(s)?
>>
>> Testing the same regex against the whole raw message outside of SA
>> seems to fire just fine.
> [...]
>
> Don't know about the spamassassin issue, but that regexp
> matches <!-- followed by a sequence of 32000 of more characters
> provided that sequence doesn't start with "-->".
>
> ITYM
>
> m|<!--(?:(?!-->).){32000,}|s
>
> That is you need to look ahead at each character of the sequence
> to look for the closing comment tag, otherwise you'll match on
> <!-- short comment --> <31982 or more characters>

And you may or may not want to match on a closing comment at the end.

m|<!--(?:(?!-->).){32000,}-->|s

Also, because of all of the lookaheads, this may be an expensive
regexp.  If you try it, keep a close eye on your SA.  If it slows down
to a crawl, this is probably the culprit.

-- 
Bowie

Re: Regex help (targetting very long HTML comments)

Posted by Stephane Chazelas <st...@gmail.com>.

2012-04-02 12:40:27 -0400, Kris Deugau:
> Can anyone point out what bit of stupidity I'm committing in trying
> to use this:
> 
> rawbody OVERSIZE_COMMENT        m|<!--(?!-->).{32000,}|s
> 
> to match messages that are mostly very very long HTML comment(s)?
> 
> Testing the same regex against the whole raw message outside of SA
> seems to fire just fine.
[...]

Don't know about the spamassassin issue, but that regexp
matches <!-- followed by a sequence of 32000 of more characters
provided that sequence doesn't start with "-->".

ITYM

m|<!--(?:(?!-->).){32000,}|s

That is you need to look ahead at each character of the sequence
to look for the closing comment tag, otherwise you'll match on
<!-- short comment --> <31982 or more characters>

-- 
Stephane

Re: Regex help (targetting very long HTML comments)

Posted by Kris Deugau <kd...@vianet.ca>.

Adam Katz wrote:
> % grep html_text_match..comment 20_html_tests.cf

I hadn't known about that function until I saw Henrik's replies last 
week, so it would have been hard to search for it.

> Any more that 512 chars isn't going to be helpful but will end up being
> computationally expensive (I've played with this idea).  Also, I'd say
> this is more of a ham indicator than a spam indicator.

*shrug*  I happen to be getting a wave of ~400K spams that consist of 
about 1K of real HTML tags, loading the spam content via image from a 
remote server, with the remainder of that 400K message consisting of 
maybe four *very* long HTML comments (50K+) with nothing but gibberish 
(groups of ~4-8 words, separated by /, ;, # and occasionally some other 
symbol).

I've also seen gobs of mail with ~5K of CSS in an HTML comment - mostly 
from Outlook.  *eyeroll*

These are most of what's still getting through to *my* inbox, but with 
~50K users I'd assume they're hitting other people as well. 
Unfortunately, as an ISP sysadmin, my ability to get useful, timely 
feedback from a high proportion of the userbase is...   limited.

-kgd

Re: Regex help (targetting very long HTML comments)

Posted by Henrik K <he...@hege.li>.

On Fri, Apr 06, 2012 at 07:07:18PM +0300, Henrik K wrote:
> On Fri, Apr 06, 2012 at 08:40:08AM -0700, Adam Katz wrote:
> > 
> > Try this:
> > 
> > body OVERSIZE_COMMENT  eval:html_text_match('comment',
> > '<!--(?!.?-->).{512,}-->')
> 
> No. See what I already posted.

Btw I put few test rules to my sandbox:

http://ruleqa.spamassassin.org/?rule=%2F__HTML_COMMENT

Not many hits. Then again the corpus is very small these days.

Re: Regex help (targetting very long HTML comments)

Posted by Henrik K <he...@hege.li>.

On Fri, Apr 06, 2012 at 08:40:08AM -0700, Adam Katz wrote:
> 
> Try this:
> 
> body OVERSIZE_COMMENT  eval:html_text_match('comment',
> '<!--(?!.?-->).{512,}-->')

No. See what I already posted.

Re: Regex help (targetting very long HTML comments)

Posted by Adam Katz <an...@khopis.com>.

On 04/02/2012 09:40 AM, Kris Deugau wrote:
> Can anyone point out what bit of stupidity I'm committing in trying
> to use this:
> 
> rawbody OVERSIZE_COMMENT        m|<!--(?!-->).{32000,}|s
> 
> to match messages that are mostly very very long HTML comment(s)?
> 
> Testing the same regex against the whole raw message outside of SA
> seems to fire just fine.

There are already a few rules that do this sort of thing.  Use them as
models:

% grep html_text_match..comment 20_html_tests.cf
body HTML_COMMENT_SHORT eval:html_text_match('comment', '<!(?!-).{0,6}>')
body HTML_COMMENT_SAVED_URL eval:html_text_match('comment', '<!-- saved
from url=\(\d{4}\)')
body __COMMENT_EXISTS eval:html_text_match('comment', '<!.*?>')

Try this:

body OVERSIZE_COMMENT  eval:html_text_match('comment',
'<!--(?!.?-->).{512,}-->')

Any more that 512 chars isn't going to be helpful but will end up being
computationally expensive (I've played with this idea).  Also, I'd say
this is more of a ham indicator than a spam indicator.