You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Kris Deugau <kd...@vianet.ca> on 2011/05/26 21:02:37 UTC
Large (usually legitimate) HTML mails choking SA
Every so often we get a message or two stuck in our inbound mail queue
because it took too long for SA to process during mail delivery.
For a little while there were actually runs of pure HTML-garbage spam
over 500K; those have been dealt with (and may have disappeared).
However, we've just had a couple of *legitimate* messages get stuck for
essentially the same reason - a whole lot of pathologically bad HTML.
(One looks like it was generated by Word, then converted to email by an
MS mail library used by a third-party SMTP mailer. There's maybe 5K
worth of actual useful HTML in about 100K of QP-encoded HTML. The
others appear to be investment-related, where the sender has included
the content as a base64-encoded .html attachment.)
Whitelisting these once they're found lets them bypass SA altogether,
but in the meantime they get stuck in the mail queue.
Has anyone got any suggestions for decreasing the load SA imposes trying
to process one of these?
-kgd
Re: Large (usually legitimate) HTML mails choking SA
Posted by Martin Gregorie <ma...@gregorie.org>.
On Thu, 2011-05-26 at 16:37 -0400, Alex wrote:
> Any tips on how to do that? When used in conjunction with amavis, is
> there a way to identify which rule consumes the most processing time,
> in the same way it can for bayes or SA overall?
>
By inspection, e.g. any rawbody rule whose regex contains .* is an
immediate suspect.
These days I tend not to write any rule with an unbounded match. As an
example, instead of "string1.*(string2|string3)" I'll write that part of
the rule as "string1.{0,n}(string2|string3)" because, unlike the
unbounded .* this cannot match a huge span of text and have to backtrack
miles before trying subsequent alternates.
Martin
Re: Large (usually legitimate) HTML mails choking SA
Posted by Alex <my...@gmail.com>.
Hi,
>> Every so often we get a message or two stuck in our inbound mail queue
>> because it took too long for SA to process during mail delivery.
>
>> However, we've just had a couple of *legitimate* messages get stuck for
>> essentially the same reason - a whole lot of pathologically bad HTML.
>
> Rings a bell. Such reports usually turned out to be caused by custom
> rules. Any custom rawbody rules, in particular ones matching HTML tags,
> or otherwise prone to trigger RE backtracking? (That is, may consume
> large sub-strings, before a following sub-pattern.)
>
>> Has anyone got any suggestions for decreasing the load SA imposes trying
>> to process one of these?
>
> Identify the bad boy. :)
Any tips on how to do that? When used in conjunction with amavis, is
there a way to identify which rule consumes the most processing time,
in the same way it can for bayes or SA overall?
Thanks,
Alex
Re: Large (usually legitimate) HTML mails choking SA
Posted by da...@chaosreigns.com.
On 05/27, John Hardin wrote:
> Yes. "*" is "zero or more, unbounded" and "+" is "one or more, unbounded".
>
> It's much better to have an upper limit in body and rawbody rules,
> e.g. {0,80} or {1,80}
>
> The upper limit may need some experimentation to set in specific
> cases, but even so, {0,255} can be much less painful than *.
So somebody should (open a bug to) go through all the rules we provide
and replace all instances of "*" with {0,255} and "+" with {1,255}?
> Header and URI texts are inherently fairly short so it's safer to
> use unbounded matches against them, but even so it's good idea to
But still vulnerable to regex DoS....
--
"I don't want to die... just yet... not while there's... women."
- J. Matthew Root, 8/23/02 (http://www.jmrart.com/)
http://www.ChaosReigns.com
Re: Large (usually legitimate) HTML mails choking SA
Posted by John Hardin <jh...@impsec.org>.
On Fri, 27 May 2011, Kris Deugau wrote:
> I have a couple of instances of [a-z]+ and similar; is that effectively as
> troublesome as .+ or .*?
Yes. "*" is "zero or more, unbounded" and "+" is "one or more, unbounded".
It's much better to have an upper limit in body and rawbody rules, e.g.
{0,80} or {1,80}
The upper limit may need some experimentation to set in specific cases,
but even so, {0,255} can be much less painful than *.
Header and URI texts are inherently fairly short so it's safer to use
unbounded matches against them, but even so it's good idea to simply get
in the habit of always using bounded matches when writing rules.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
How can you reason with someone who thinks we're on a glidepath to
a police state and yet their solution is to grant the government a
monopoly on force? They are insane.
-----------------------------------------------------------------------
3 days until Memorial Day - honor those who sacrificed for our liberty
Re: Large (usually legitimate) HTML mails choking SA
Posted by "David F. Skoll" <df...@roaringpenguin.com>.
On Fri, 27 May 2011 10:38:17 -0400
Kris Deugau <kd...@vianet.ca> wrote:
> I have a couple of instances of [a-z]+ and similar; is that
> effectively as troublesome as .+ or .*?
It could be, depending on what else is in the regex. There's a fairly
nice Wikipedia article about evil regexes:
http://en.wikipedia.org/wiki/ReDoS#Evil_regexes
When I write SA rules, I never use the * or + operators. I always
use something like {0,40} or {1,40} just to be on the safe side.
(That still does not eliminate the possiblity of exponential behaviour
from bad regexes, but it does offer some protection against bad behaviour
from unfortunate strings to be matched.)
Regards,
David.
Re: Large (usually legitimate) HTML mails choking SA
Posted by Kris Deugau <kd...@vianet.ca>.
Karsten Bräckelmann wrote:
> However, using (?:\s|\ )* also does the trick. Yes, keeping the
> nasty asterisk quantifier. The difference is merely dropping the \n from
> the alternation, which is part of \s whitespace anyway.
>
> Wondering if this is a case where Perl fails to optimize out the \n.
> Which would result in an alternation with overlap...
Hmm. This may be a Perl-version-specific (or
which-flags-Perl-was-built-with thing) then, because I've been adding \n
on rawbody rules where I want to match multiple physical lines because
\s *hasn't* been matching newlines - at least, not all the time.
-kgd
Re: Large (usually legitimate) HTML mails choking SA
Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2011-05-27 at 13:14 -0400, Kris Deugau wrote:
> Karsten Bräckelmann wrote:
> > Yes, that sounds like the culprit indeed is one or more custom rule. If
> > that "much faster" equals twice as fast,
>
> Probably closer to 4-6x; dual PIII/866 -> Core i3 3GHz.
Sure -- that "twice" assumption was just a quickly assumed lower bound,
that still shows the dramatic difference of the custom rule burning a
whopping 25 times the CPU.
> > Bisection is your friend.
> >
> > Go hunt down that bugger, that in conjunction with the specific sample
> > kills your performance. Once you found it, maybe you can post it?
>
> Seems to have been this:
>
> rawbody TOO_MANY_DIVS /(?:<[Dd][Ii][Vv]>(?:\s|\n|\ \;)*){6}/
Aha! Yes, that nesting of quantifiers sure looks like a prime candidate.
Even though this isn't the pure evil form -- which would be to have two
alternatives with overlap in sub-patterns.
Or maybe it is. Frankly, not sure what exactly causes the RE to go
berserk.
> Changing the * to {,100} drops the processing time down to ~8s.
Confirmed, grabbed your sample and this eliminates the issue.
However, using (?:\s|\ )* also does the trick. Yes, keeping the
nasty asterisk quantifier. The difference is merely dropping the \n from
the alternation, which is part of \s whitespace anyway.
Wondering if this is a case where Perl fails to optimize out the \n.
Which would result in an alternation with overlap...
--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Large (usually legitimate) HTML mails choking SA
Posted by Kris Deugau <kd...@vianet.ca>.
Karsten Bräckelmann wrote:
> On Fri, 2011-05-27 at 10:38 -0400, Kris Deugau wrote:
>> Mmmm. I don't *think* so, but testing the message on a stock SA 3.3.1
>> took "only" a minute (on slow hardware) vs 13 (on my much faster desktop).
>
> The latter being the production system with the custom rules, or at
> least having an identical set of custom rules?
Yeah; I create the rules on my desktop (usually with an example spam on
hand to make sure the rule hits what I intended it to hit), commit to
svn, and periodically merge changes to a branch that's autopublished in
something resembling the same way as the official stock rules and JM's
SOUGHT rules.
> Yes, that sounds like the culprit indeed is one or more custom rule. If
> that "much faster" equals twice as fast,
Probably closer to 4-6x; dual PIII/866 -> Core i3 3GHz.
> Bisection is your friend.
>
> Go hunt down that bugger, that in conjunction with the specific sample
> kills your performance. Once you found it, maybe you can post it?
Seems to have been this:
rawbody TOO_MANY_DIVS /(?:<[Dd][Ii][Vv]>(?:\s|\n|\ \;)*){6}/
describe TOO_MANY_DIVS 6 or move <div> tags in a row
score TOO_MANY_DIVS 0.75
Changing the * to {,100} drops the processing time down to ~8s.
I've got a number of similar rules for other "many logical/physical
linebreaks with no content". I don't have a specific spample to point
to just now, but from memory the original targets really did have a
widely varying number of linebreaks or whitespace (logical or otherwise)
in between the HTML tags, and I've been bitten before with applying
bounds to matches (related rules for garbage HTML comments) not being
*large* enough. O_o
This particular message has page after page of:
=09=09=09
=09=09=09
=09=09=09
=09
=09
=09
etc, with a few <div> or <font> tags for excitement.
-kgd
Re: Large (usually legitimate) HTML mails choking SA
Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2011-05-27 at 10:38 -0400, Kris Deugau wrote:
> Karsten Bräckelmann wrote:
> > > However, we've just had a couple of *legitimate* messages get stuck for
> > > essentially the same reason - a whole lot of pathologically bad HTML.
> >
> > Rings a bell. Such reports usually turned out to be caused by custom
> > rules. Any custom rawbody rules, in particular ones matching HTML tags,
>
> Yes, a few.
>
> > or otherwise prone to trigger RE backtracking? (That is, may consume
> > large sub-strings, before a following sub-pattern.)
>
> Mmmm. I don't *think* so, but testing the message on a stock SA 3.3.1
> took "only" a minute (on slow hardware) vs 13 (on my much faster desktop).
The latter being the production system with the custom rules, or at
least having an identical set of custom rules?
Yes, that sounds like the culprit indeed is one or more custom rule. If
that "much faster" equals twice as fast, your custom rules are taking
25(!) times as long as the complete stock rule-set, including all the
parsing and stuff.
Bisection is your friend.
Go hunt down that bugger, that in conjunction with the specific sample
kills your performance. Once you found it, maybe you can post it?
> I have a couple of instances of [a-z]+ and similar; is that effectively
> as troublesome as .+ or .*?
That on its own (i.e. not nested inside an alternation, etc) is very
unlikely to be the issue, since it appears to be triggered by the HTML
in the message.
--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Large (usually legitimate) HTML mails choking SA
Posted by Kris Deugau <kd...@vianet.ca>.
Karsten Bräckelmann wrote:
> On Thu, 2011-05-26 at 15:02 -0400, Kris Deugau wrote:
>> Every so often we get a message or two stuck in our inbound mail queue
>> because it took too long for SA to process during mail delivery.
>
>> However, we've just had a couple of *legitimate* messages get stuck for
>> essentially the same reason - a whole lot of pathologically bad HTML.
>
> Rings a bell. Such reports usually turned out to be caused by custom
> rules. Any custom rawbody rules, in particular ones matching HTML tags,
Yes, a few.
> or otherwise prone to trigger RE backtracking? (That is, may consume
> large sub-strings, before a following sub-pattern.)
Mmmm. I don't *think* so, but testing the message on a stock SA 3.3.1
took "only" a minute (on slow hardware) vs 13 (on my much faster desktop).
I have a couple of instances of [a-z]+ and similar; is that effectively
as troublesome as .+ or .*?
... Hm. I also notice I have more custom local rules than there are
stock rules. I *really* need to get some testing infrastructure in
place to trim that list down. O_o
-kgd
Re: Large (usually legitimate) HTML mails choking SA
Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2011-05-26 at 15:02 -0400, Kris Deugau wrote:
> Every so often we get a message or two stuck in our inbound mail queue
> because it took too long for SA to process during mail delivery.
> However, we've just had a couple of *legitimate* messages get stuck for
> essentially the same reason - a whole lot of pathologically bad HTML.
Rings a bell. Such reports usually turned out to be caused by custom
rules. Any custom rawbody rules, in particular ones matching HTML tags,
or otherwise prone to trigger RE backtracking? (That is, may consume
large sub-strings, before a following sub-pattern.)
> Has anyone got any suggestions for decreasing the load SA imposes trying
> to process one of these?
Identify the bad boy. :)
--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Large (usually legitimate) HTML mails choking SA
Posted by Kris Deugau <kd...@vianet.ca>.
John Hardin wrote:
> On Thu, 26 May 2011, Kris Deugau wrote:
>
>> Whitelisting these once they're found lets them bypass SA altogether,
>> but in the meantime they get stuck in the mail queue.
>>
>> Has anyone got any suggestions for decreasing the load SA imposes
>> trying to process one of these?
>
> Any possibility of getting a sample?
Eugh, that was *nasty*.
Thoroughly anonymized version at
http://www.deepnet.cx/~kdeugau/spamtools/nastyhtml.eml.
And the HTML is really, truly, *nasty*. I've never seen such a
spectacular mess that's still legal HTML, even from Word or Frontpage.
And of course, because it's so nasty, I had to hand-edit it to anonymize
it because otherwise any HTML editor would have cleaned it up.... >_<
-kgd
Re: Large (usually legitimate) HTML mails choking SA
Posted by John Hardin <jh...@impsec.org>.
On Thu, 26 May 2011, Kris Deugau wrote:
> Whitelisting these once they're found lets them bypass SA altogether,
> but in the meantime they get stuck in the mail queue.
>
> Has anyone got any suggestions for decreasing the load SA imposes trying
> to process one of these?
Any possibility of getting a sample?
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Think Microsoft cares about your needs at all?
"A company wanted to hold off on upgrading Microsoft Office for a
year in order to do other projects. So Microsoft gave a 'free' copy
of the new Office to the CEO -- a copy that of course generated
errors for anyone else in the firm reading his documents. The CEO
got tired of getting the 'please re-send in XX format' so he
ordered other projects put on hold and the Office upgrade to be top
priority." -- Cringely, 4/8/2004
-----------------------------------------------------------------------
4 days until Memorial Day - honor those who sacrificed for our liberty
Re: Large (usually legitimate) HTML mails choking SA
Posted by RW <rw...@googlemail.com>.
On Thu, 26 May 2011 15:02:37 -0400
Kris Deugau <kd...@vianet.ca> wrote:
> Every so often we get a message or two stuck in our inbound mail
> queue because it took too long for SA to process during mail delivery.
>
> For a little while there were actually runs of pure HTML-garbage spam
> over 500K; those have been dealt with (and may have disappeared).
>...
> Has anyone got any suggestions for decreasing the load SA imposes
> trying to process one of these?
How about short-circuiting on BAYES_00?