You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Kris Deugau <kd...@vianet.ca> on 2011/05/26 21:02:37 UTC

Large (usually legitimate) HTML mails choking SA

Every so often we get a message or two stuck in our inbound mail queue 
because it took too long for SA to process during mail delivery.

For a little while there were actually runs of pure HTML-garbage spam 
over 500K;  those have been dealt with (and may have disappeared).

However, we've just had a couple of *legitimate* messages get stuck for 
essentially the same reason - a whole lot of pathologically bad HTML. 
(One looks like it was generated by Word, then converted to email by an 
MS mail library used by a third-party SMTP mailer.  There's maybe 5K 
worth of actual useful HTML in about 100K of QP-encoded HTML.  The 
others appear to be investment-related, where the sender has included 
the content as a base64-encoded .html attachment.)

Whitelisting these once they're found lets them bypass SA altogether, 
but in the meantime they get stuck in the mail queue.

Has anyone got any suggestions for decreasing the load SA imposes trying 
to process one of these?

-kgd

Re: Large (usually legitimate) HTML mails choking SA

Posted by Martin Gregorie <ma...@gregorie.org>.
On Thu, 2011-05-26 at 16:37 -0400, Alex wrote:
> Any tips on how to do that? When used in conjunction with amavis, is
> there a way to identify which rule consumes the most processing time,
> in the same way it can for bayes or SA overall?
> 
By inspection, e.g. any rawbody rule whose regex contains .* is an
immediate suspect. 

These days I tend not to write any rule with an unbounded match. As an
example, instead of "string1.*(string2|string3)" I'll write that part of
the rule as "string1.{0,n}(string2|string3)" because, unlike the
unbounded .* this cannot match a huge span of text and have to backtrack
miles before trying subsequent alternates.


Martin



Re: Large (usually legitimate) HTML mails choking SA

Posted by Alex <my...@gmail.com>.
Hi,

>> Every so often we get a message or two stuck in our inbound mail queue
>> because it took too long for SA to process during mail delivery.
>
>> However, we've just had a couple of *legitimate* messages get stuck for
>> essentially the same reason - a whole lot of pathologically bad HTML.
>
> Rings a bell. Such reports usually turned out to be caused by custom
> rules. Any custom rawbody rules, in particular ones matching HTML tags,
> or otherwise prone to trigger RE backtracking? (That is, may consume
> large sub-strings, before a following sub-pattern.)
>
>> Has anyone got any suggestions for decreasing the load SA imposes trying
>> to process one of these?
>
> Identify the bad boy. :)

Any tips on how to do that? When used in conjunction with amavis, is
there a way to identify which rule consumes the most processing time,
in the same way it can for bayes or SA overall?

Thanks,
Alex

Re: Large (usually legitimate) HTML mails choking SA

Posted by da...@chaosreigns.com.
On 05/27, John Hardin wrote:
> Yes. "*" is "zero or more, unbounded" and "+" is "one or more, unbounded".
> 
> It's much better to have an upper limit in body and rawbody rules,
> e.g. {0,80} or {1,80}
> 
> The upper limit may need some experimentation to set in specific
> cases, but even so, {0,255} can be much less painful than *.

So somebody should (open a bug to) go through all the rules we provide
and replace all instances of "*" with {0,255} and "+" with {1,255}?

> Header and URI texts are inherently fairly short so it's safer to
> use unbounded matches against them, but even so it's good idea to

But still vulnerable to regex DoS....

-- 
"I don't want to die... just yet... not while there's... women."
- J. Matthew Root, 8/23/02 (http://www.jmrart.com/)
http://www.ChaosReigns.com

Re: Large (usually legitimate) HTML mails choking SA

Posted by John Hardin <jh...@impsec.org>.
On Fri, 27 May 2011, Kris Deugau wrote:

> I have a couple of instances of [a-z]+ and similar;  is that effectively as 
> troublesome as .+ or .*?

Yes. "*" is "zero or more, unbounded" and "+" is "one or more, unbounded".

It's much better to have an upper limit in body and rawbody rules, e.g. 
{0,80} or {1,80}

The upper limit may need some experimentation to set in specific cases, 
but even so, {0,255} can be much less painful than *.

Header and URI texts are inherently fairly short so it's safer to use 
unbounded matches against them, but even so it's good idea to simply get 
in the habit of always using bounded matches when writing rules.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   How can you reason with someone who thinks we're on a glidepath to
   a police state and yet their solution is to grant the government a
   monopoly on force? They are insane.
-----------------------------------------------------------------------
  3 days until Memorial Day - honor those who sacrificed for our liberty

Re: Large (usually legitimate) HTML mails choking SA

Posted by "David F. Skoll" <df...@roaringpenguin.com>.
On Fri, 27 May 2011 10:38:17 -0400
Kris Deugau <kd...@vianet.ca> wrote:

> I have a couple of instances of [a-z]+ and similar;  is that
> effectively as troublesome as .+ or .*?

It could be, depending on what else is in the regex.  There's a fairly
nice Wikipedia article about evil regexes:

http://en.wikipedia.org/wiki/ReDoS#Evil_regexes

When I write SA rules, I never use the * or + operators.  I always
use something like {0,40} or {1,40} just to be on the safe side.

(That still does not eliminate the possiblity of exponential behaviour
from bad regexes, but it does offer some protection against bad behaviour
from unfortunate strings to be matched.)

Regards,

David.

Re: Large (usually legitimate) HTML mails choking SA

Posted by Kris Deugau <kd...@vianet.ca>.
Karsten Bräckelmann wrote:
> However, using (?:\s|\&nbsp;)* also does the trick. Yes, keeping the
> nasty asterisk quantifier. The difference is merely dropping the \n from
> the alternation, which is part of \s whitespace anyway.
>
> Wondering if this is a case where Perl fails to optimize out the \n.
> Which would result in an alternation with overlap...

Hmm.  This may be a Perl-version-specific (or 
which-flags-Perl-was-built-with thing) then, because I've been adding \n 
on rawbody rules where I want to match multiple physical lines because 
\s *hasn't* been matching newlines - at least, not all the time.

-kgd

Re: Large (usually legitimate) HTML mails choking SA

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2011-05-27 at 13:14 -0400, Kris Deugau wrote:
> Karsten Bräckelmann wrote:

> > Yes, that sounds like the culprit indeed is one or more custom rule. If
> > that "much faster" equals twice as fast,
> 
> Probably closer to 4-6x;  dual PIII/866 -> Core i3 3GHz.

Sure -- that "twice" assumption was just a quickly assumed lower bound,
that still shows the dramatic difference of the custom rule burning a
whopping 25 times the CPU.

> > Bisection is your friend.
> >
> > Go hunt down that bugger, that in conjunction with the specific sample
> > kills your performance. Once you found it, maybe you can post it?
> 
> Seems to have been this:
> 
> rawbody TOO_MANY_DIVS	/(?:<[Dd][Ii][Vv]>(?:\s|\n|\&nbsp\;)*){6}/

Aha! Yes, that nesting of quantifiers sure looks like a prime candidate.
Even though this isn't the pure evil form -- which would be to have two
alternatives with overlap in sub-patterns.

Or maybe it is. Frankly, not sure what exactly causes the RE to go
berserk.

> Changing the * to {,100} drops the processing time down to ~8s.

Confirmed, grabbed your sample and this eliminates the issue.

However, using (?:\s|\&nbsp;)* also does the trick. Yes, keeping the
nasty asterisk quantifier. The difference is merely dropping the \n from
the alternation, which is part of \s whitespace anyway.

Wondering if this is a case where Perl fails to optimize out the \n.
Which would result in an alternation with overlap...


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Large (usually legitimate) HTML mails choking SA

Posted by Kris Deugau <kd...@vianet.ca>.
Karsten Bräckelmann wrote:
> On Fri, 2011-05-27 at 10:38 -0400, Kris Deugau wrote:
>> Mmmm.  I don't *think* so, but testing the message on a stock SA 3.3.1
>> took "only" a minute (on slow hardware) vs 13 (on my much faster desktop).
>
> The latter being the production system with the custom rules, or at
> least having an identical set of custom rules?

Yeah;  I create the rules on my desktop (usually with an example spam on 
hand to make sure the rule hits what I intended it to hit), commit to 
svn, and periodically merge changes to a branch that's autopublished in 
something resembling the same way as the official stock rules and JM's 
SOUGHT rules.

> Yes, that sounds like the culprit indeed is one or more custom rule. If
> that "much faster" equals twice as fast,

Probably closer to 4-6x;  dual PIII/866 -> Core i3 3GHz.

> Bisection is your friend.
>
> Go hunt down that bugger, that in conjunction with the specific sample
> kills your performance. Once you found it, maybe you can post it?

Seems to have been this:

rawbody TOO_MANY_DIVS	/(?:<[Dd][Ii][Vv]>(?:\s|\n|\&nbsp\;)*){6}/
describe TOO_MANY_DIVS	6 or move <div> tags in a row
score TOO_MANY_DIVS	0.75

Changing the * to {,100} drops the processing time down to ~8s.

I've got a number of similar rules for other "many logical/physical 
linebreaks with no content".  I don't have a specific spample to point 
to just now, but from memory the original targets really did have a 
widely varying number of linebreaks or whitespace (logical or otherwise) 
in between the HTML tags, and I've been bitten before with applying 
bounds to matches (related rules for garbage HTML comments) not being 
*large* enough.  O_o

This particular message has page after page of:

=09=09=09
=09=09=09
=09=09=09
=09
=09
=09

etc, with a few <div> or <font> tags for excitement.

-kgd

Re: Large (usually legitimate) HTML mails choking SA

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2011-05-27 at 10:38 -0400, Kris Deugau wrote:
> Karsten Bräckelmann wrote:
> > > However, we've just had a couple of *legitimate* messages get stuck for
> > > essentially the same reason - a whole lot of pathologically bad HTML.
> >
> > Rings a bell. Such reports usually turned out to be caused by custom
> > rules. Any custom rawbody rules, in particular ones matching HTML tags,
> 
> Yes, a few.
> 
> > or otherwise prone to trigger RE backtracking? (That is, may consume
> > large sub-strings, before a following sub-pattern.)
> 
> Mmmm.  I don't *think* so, but testing the message on a stock SA 3.3.1 
> took "only" a minute (on slow hardware) vs 13 (on my much faster desktop).

The latter being the production system with the custom rules, or at
least having an identical set of custom rules?

Yes, that sounds like the culprit indeed is one or more custom rule. If
that "much faster" equals twice as fast, your custom rules are taking
25(!) times as long as the complete stock rule-set, including all the
parsing and stuff.

Bisection is your friend.

Go hunt down that bugger, that in conjunction with the specific sample
kills your performance. Once you found it, maybe you can post it?


> I have a couple of instances of [a-z]+ and similar;  is that effectively 
> as troublesome as .+ or .*?

That on its own (i.e. not nested inside an alternation, etc) is very
unlikely to be the issue, since it appears to be triggered by the HTML
in the message.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Large (usually legitimate) HTML mails choking SA

Posted by Kris Deugau <kd...@vianet.ca>.
Karsten Bräckelmann wrote:
> On Thu, 2011-05-26 at 15:02 -0400, Kris Deugau wrote:
>> Every so often we get a message or two stuck in our inbound mail queue
>> because it took too long for SA to process during mail delivery.
>
>> However, we've just had a couple of *legitimate* messages get stuck for
>> essentially the same reason - a whole lot of pathologically bad HTML.
>
> Rings a bell. Such reports usually turned out to be caused by custom
> rules. Any custom rawbody rules, in particular ones matching HTML tags,

Yes, a few.

> or otherwise prone to trigger RE backtracking? (That is, may consume
> large sub-strings, before a following sub-pattern.)

Mmmm.  I don't *think* so, but testing the message on a stock SA 3.3.1 
took "only" a minute (on slow hardware) vs 13 (on my much faster desktop).

I have a couple of instances of [a-z]+ and similar;  is that effectively 
as troublesome as .+ or .*?

...  Hm.  I also notice I have more custom local rules than there are 
stock rules.  I *really* need to get some testing infrastructure in 
place to trim that list down.  O_o

-kgd

Re: Large (usually legitimate) HTML mails choking SA

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2011-05-26 at 15:02 -0400, Kris Deugau wrote:
> Every so often we get a message or two stuck in our inbound mail queue 
> because it took too long for SA to process during mail delivery.

> However, we've just had a couple of *legitimate* messages get stuck for 
> essentially the same reason - a whole lot of pathologically bad HTML. 

Rings a bell. Such reports usually turned out to be caused by custom
rules. Any custom rawbody rules, in particular ones matching HTML tags,
or otherwise prone to trigger RE backtracking? (That is, may consume
large sub-strings, before a following sub-pattern.)


> Has anyone got any suggestions for decreasing the load SA imposes trying 
> to process one of these?

Identify the bad boy. :)


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Large (usually legitimate) HTML mails choking SA

Posted by Kris Deugau <kd...@vianet.ca>.
John Hardin wrote:
> On Thu, 26 May 2011, Kris Deugau wrote:
>
>> Whitelisting these once they're found lets them bypass SA altogether,
>> but in the meantime they get stuck in the mail queue.
>>
>> Has anyone got any suggestions for decreasing the load SA imposes
>> trying to process one of these?
>
> Any possibility of getting a sample?

Eugh, that was *nasty*.

Thoroughly anonymized version at 
http://www.deepnet.cx/~kdeugau/spamtools/nastyhtml.eml.

And the HTML is really, truly, *nasty*.  I've never seen such a 
spectacular mess that's still legal HTML, even from Word or Frontpage.

And of course, because it's so nasty, I had to hand-edit it to anonymize 
it because otherwise any HTML editor would have cleaned it up....   >_<

-kgd

Re: Large (usually legitimate) HTML mails choking SA

Posted by John Hardin <jh...@impsec.org>.
On Thu, 26 May 2011, Kris Deugau wrote:

> Whitelisting these once they're found lets them bypass SA altogether, 
> but in the meantime they get stuck in the mail queue.
>
> Has anyone got any suggestions for decreasing the load SA imposes trying 
> to process one of these?

Any possibility of getting a sample?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Think Microsoft cares about your needs at all?
   "A company wanted to hold off on upgrading Microsoft Office for a
   year in order to do other projects. So Microsoft gave a 'free' copy
   of the new Office to the CEO -- a copy that of course generated
   errors for anyone else in the firm reading his documents. The CEO
   got tired of getting the 'please re-send in XX format' so he
   ordered other projects put on hold and the Office upgrade to be top
   priority."                                    -- Cringely, 4/8/2004
-----------------------------------------------------------------------
  4 days until Memorial Day - honor those who sacrificed for our liberty

Re: Large (usually legitimate) HTML mails choking SA

Posted by RW <rw...@googlemail.com>.
On Thu, 26 May 2011 15:02:37 -0400
Kris Deugau <kd...@vianet.ca> wrote:

> Every so often we get a message or two stuck in our inbound mail
> queue because it took too long for SA to process during mail delivery.
> 
> For a little while there were actually runs of pure HTML-garbage spam 
> over 500K;  those have been dealt with (and may have disappeared).
>...
> Has anyone got any suggestions for decreasing the load SA imposes
> trying to process one of these?

How about short-circuiting on BAYES_00?