You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Kevin Golding <ke...@caomhin.demon.co.uk> on 2010/02/16 17:56:43 UTC

Slow masscheck

It seems very few people have uploaded masscheck results the past couple
of days.  Given that I just noticed my usual ~15 minute run from
yesterday was still chugging along ~30 hours later (and this morning's
has been running for ~6 hours) I'm wondering if the lack of results from
others are for the same reason?

Or in other words, did something slow the masscheck code considerably?

Kevin

Re: Slow masscheck

Posted by John Hardin <jh...@impsec.org>.
On Thu, 18 Feb 2010, Kevin Golding wrote:

> Just as a follow-up it was down to ~35 minutes today so it looks like
> it's no longer scary.

Yeah, mea culpa - sorry. I got out of the habit of running a full local 
masscheck before committing. I'll be good, I promise. :)

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Windows and its users got mentioned at home today, after my wife the
  psych major brought up Seligman's theory of "learned helplessness."
                                              -- Dan Birchall in a.s.r
-----------------------------------------------------------------------
  4 days until George Washington's 278th Birthday

Re: Slow masscheck

Posted by Kevin Golding <ke...@caomhin.demon.co.uk>.
Just as a follow-up it was down to ~35 minutes today so it looks like
it's no longer scary.

Kevin

Re: Slow masscheck

Posted by Justin Mason <jm...@jmason.org>.
2010/2/17 John Hardin <jh...@impsec.org>:
> On Wed, 17 Feb 2010, Karsten Bräckelmann wrote:
>
>>>> I'm seeing the same explosion in run times.
>>>
>>> Looking at commits between r909296 and r910179, I think the issue may be
>>> with one a new rule in jhardin/20_misc_testing.cf from r910157:
>>>
>>> +rawbody STYLE_GIBBERISH /<style[^>]{0,30}>(?:\s{1,20}|[^\s:;<]){175}/im
>>
>> Heh, I was just looking at the very same and bookmarked it for review
>> tomorrow morning.
>>
>> However, it shouldn't be that bad -- since it is bound, and the
>> alternatives are spaces or non-spaces. That should not lead to massive
>> backtracking.
>
> I was wondering about that one, too. I'll take it back out. I'm thinking of
> a better way to achieve that.

+1 please do.  I'd say that's the issue.

-- 
--j.

Re: Slow masscheck

Posted by John Hardin <jh...@impsec.org>.
On Wed, 17 Feb 2010, Karsten Br�ckelmann wrote:

> On Wed, 2010-02-17 at 10:55 -0800, John Hardin wrote:
>> That's where I started this, and then I ran into a situation where I
>> wanted to collapse whitespace, at which point things went south.
>
> Hmm... Collapsing whitespace. Going back to the original RE, and what it
> probably was meant to be. "A minimum of 175 chars other than [:;] before
> the tag closing, with any whitespace ignored."
>
>  /(?:\s*[^\s:;<]){175}/
>
> What about that, then? :)  Unlike the original, this one simply swallows
> any whitespace and operates on a minimum number of non-whitespace, non-
> style definition syntax (very basic set). Whereas the original RE
> happily would have accepted 1*175, up until 20*175 chars of pure
> whitespace. Hardly gibberish, but just a waste of bandwidth. ;)

I tried something like that and it didn't perform well, but trying it 
again it seems to work. I wonder what I messed up the first time I tried 
it?

Testing...

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Gun Control laws aren't enacted to control guns, they are enacted
   to control people: catholics (1500s), japanese peasants (1600s),
   blacks (1860s), italian immigrants (1911), the irish (1920s),
   jews (1930s), blacks (1960s), the poor (always)
-----------------------------------------------------------------------
  5 days until George Washington's 278th Birthday

Re: Slow masscheck

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Wed, 2010-02-17 at 10:55 -0800, John Hardin wrote:
> That's where I started this, and then I ran into a situation where I 
> wanted to collapse whitespace, at which point things went south.

Hmm... Collapsing whitespace. Going back to the original RE, and what it
probably was meant to be. "A minimum of 175 chars other than [:;] before
the tag closing, with any whitespace ignored."

  /(?:\s*[^\s:;<]){175}/

What about that, then? :)  Unlike the original, this one simply swallows
any whitespace and operates on a minimum number of non-whitespace, non-
style definition syntax (very basic set). Whereas the original RE
happily would have accepted 1*175, up until 20*175 chars of pure
whitespace. Hardly gibberish, but just a waste of bandwidth. ;)


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Slow masscheck

Posted by John Hardin <jh...@impsec.org>.
On Wed, 17 Feb 2010, Karsten Br�ckelmann wrote:

> On Wed, 2010-02-17 at 08:19 -0800, John Hardin wrote:
>
>> I don't think there's any really _good_ way do what I'm trying to do in 
>> a rawbody rule. I'm now thinking a plugin that pulls out specified HTML 
>> tags and their contents and allows rules on them is the best way to 
>> approach this, for example:
>>
>>    tagbody  STYLE_GIBBERISH  style =~ /^[^:;]{200}/
>
> That actually looks useful and quite elegant, but it should be easy to
> convert into a simple rawbody rule, no? Without any substantial amount
> of thinking about it:
>
>  rawbody STYLE_GIBBERISH  /<style[^>]*>[^:;<]{200}/im

For that simple RE yes, but more-complex REs on tag contents a rawbody 
rule probably gets ugly fast.

That's where I started this, and then I ran into a situation where I 
wanted to collapse whitespace, at which point things went south.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   People seem to have this obsession with objects and tools as being
   dangerous in and of themselves, as though a weapon will act of its
   own accord to cause harm. A weapon is just a force multiplier. It's
   *humans* that are (or are not) dangerous.
-----------------------------------------------------------------------
  5 days until George Washington's 278th Birthday

Re: Slow masscheck

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Wed, 2010-02-17 at 08:19 -0800, John Hardin wrote:
> On Wed, 17 Feb 2010, Karsten Brckelmann wrote:
> 
> >>>> rawbody STYLE_GIBBERISH /<style[^>]{0,30}>(?:\s{1,20}|[^\s:;<]){175}/im
> 
> > The problem is nested quantifiers with an alternation.
> >
> > An alternative approach that should match the desired would look like
> > this -- eliminating the alternation with quantifiers inside.

> > John, does the above example help? :)
> 
> Not enough. What if there are more than 20 spaces? Or no spaces in a block 
> of more than 80 non-punctuation characters?

It was merely example showing a different approach to a similar RE.
You're free to adjust the numbers -- I pretty much pulled them out of
the air anyway.

> I don't think there's any really _good_ way do what I'm trying to do in a 
> rawbody rule. I'm now thinking a plugin that pulls out specified HTML tags 
> and their contents and allows rules on them is the best way to approach 
> this, for example:
> 
>    tagbody  STYLE_GIBBERISH  style =~ /^[^:;]{200}/

That actually looks useful and quite elegant, but it should be easy to
convert into a simple rawbody rule, no? Without any substantial amount
of thinking about it:

  rawbody STYLE_GIBBERISH  /<style[^>]*>[^:;<]{200}/im

Of course, it isn't limited to text/html parts.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Slow masscheck

Posted by John Hardin <jh...@impsec.org>.
On Wed, 17 Feb 2010, Karsten Br�ckelmann wrote:

>>>> rawbody STYLE_GIBBERISH /<style[^>]{0,30}>(?:\s{1,20}|[^\s:;<]){175}/im

> The problem is nested quantifiers with an alternation.
>
> An alternative approach that should match the desired would look like
> this -- eliminating the alternation with quantifiers inside.
>
>  / (?: \s{1,20} [^\s:;<]{1,80} ){80} /x    # spaces for readability
>
> Since there is no alternation and the two char classes are distinct, 
> this RE can be simply expanded and matched from left to right, without 
> any ambiguity.
>
> John, does the above example help? :)

Not enough. What if there are more than 20 spaces? Or no spaces in a block 
of more than 80 non-punctuation characters?

I don't think there's any really _good_ way do what I'm trying to do in a 
rawbody rule. I'm now thinking a plugin that pulls out specified HTML tags 
and their contents and allows rules on them is the best way to approach 
this, for example:

   tagbody  STYLE_GIBBERISH  style =~ /^[^:;]{200}/

This would be more generally useful, too.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Our government should bear in mind the fact that the American
   Revolution was touched off by the then-current government
   attempting to confiscate firearms from the people.
-----------------------------------------------------------------------
  5 days until George Washington's 278th Birthday

Re: Slow masscheck

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Wed, 2010-02-17 at 06:36 -0800, John Hardin wrote:
> On Wed, 17 Feb 2010, Karsten Bräckelmann wrote:

> > > +rawbody STYLE_GIBBERISH /<style[^>]{0,30}>(?:\s{1,20}|[^\s:;<]){175}/im
> >
> > Heh, I was just looking at the very same and bookmarked it for review
> > tomorrow morning.
> >
> > However, it shouldn't be that bad -- since it is bound, and the
> > alternatives are spaces or non-spaces. That should not lead to massive
> > backtracking.

Argh! On second thought, something I overlooked yesterday night. That RE
does have *massive* problems with some pathological cases of lots of
spaces. To fit exactly 175 occurrences, it might be necessary e.g. to
split an initial greedy 20 white-spaces match into multiple consecutive
matches of <20 spaces.

The problem is nested quantifiers with an alternation.

An alternative approach that should match the desired would look like
this -- eliminating the alternation with quantifiers inside.

  / (?: \s{1,20} [^\s:;<]{1,80} ){80} /x    # spaces for readability

Since there is no alternation and the two char classes are distinct,
this RE can be simply expanded and matched from left to right, without
any ambiguity.


> I was wondering about that one, too. I'll take it back out. I'm thinking 
> of a better way to achieve that.

John, does the above example help? :)


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Slow masscheck

Posted by John Hardin <jh...@impsec.org>.
On Wed, 17 Feb 2010, Karsten Br�ckelmann wrote:

>>> I'm seeing the same explosion in run times.
>>
>> Looking at commits between r909296 and r910179, I think the issue may be
>> with one a new rule in jhardin/20_misc_testing.cf from r910157:
>>
>> +rawbody STYLE_GIBBERISH /<style[^>]{0,30}>(?:\s{1,20}|[^\s:;<]){175}/im
>
> Heh, I was just looking at the very same and bookmarked it for review
> tomorrow morning.
>
> However, it shouldn't be that bad -- since it is bound, and the
> alternatives are spaces or non-spaces. That should not lead to massive
> backtracking.

I was wondering about that one, too. I'll take it back out. I'm thinking 
of a better way to achieve that.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   News flash: Lowest Common Denominator down 50 points
-----------------------------------------------------------------------
  5 days until George Washington's 278th Birthday

Re: Slow masscheck

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
> > I'm seeing the same explosion in run times.
> 
> Looking at commits between r909296 and r910179, I think the issue may be
> with one a new rule in jhardin/20_misc_testing.cf from r910157:
> 
> +rawbody STYLE_GIBBERISH /<style[^>]{0,30}>(?:\s{1,20}|[^\s:;<]){175}/im

Heh, I was just looking at the very same and bookmarked it for review
tomorrow morning.

However, it shouldn't be that bad -- since it is bound, and the
alternatives are spaces or non-spaces. That should not lead to massive
backtracking.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Slow masscheck

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
On 16/02/2010 6:21 PM, Daryl C. W. O'Shea wrote:
> On 16/02/2010 11:56 AM, Kevin Golding wrote:
>> It seems very few people have uploaded masscheck results the past couple
>> of days.  Given that I just noticed my usual ~15 minute run from
>> yesterday was still chugging along ~30 hours later (and this morning's
>> has been running for ~6 hours) I'm wondering if the lack of results from
>> others are for the same reason?
>>
>> Or in other words, did something slow the masscheck code considerably?
> 
> I'm seeing the same explosion in run times.

Looking at commits between r909296 and r910179, I think the issue may be
with one a new rule in jhardin/20_misc_testing.cf from r910157:

+rawbody STYLE_GIBBERISH /<style[^>]{0,30}>(?:\s{1,20}|[^\s:;<]){175}/im

Daryl


Re: Slow masscheck

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
On 16/02/2010 11:56 AM, Kevin Golding wrote:
> It seems very few people have uploaded masscheck results the past couple
> of days.  Given that I just noticed my usual ~15 minute run from
> yesterday was still chugging along ~30 hours later (and this morning's
> has been running for ~6 hours) I'm wondering if the lack of results from
> others are for the same reason?
> 
> Or in other words, did something slow the masscheck code considerably?

I'm seeing the same explosion in run times.

Daryl